Font Size: a A A

Historical Mongolian Document Recognition Based On Deep Learning And Knowledge Strategies

Posted on:2017-02-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:X D SuFull Text:PDF
GTID:1108330485966601Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Historical Mongolian documents are valuable and reliable resources for Mongolian history research. The library of Inner Mongolia University initiates a digitalization program to save, collect and preserve these documents. In the pro-gram, these documents were converted into images, which are accessible through the website. However, it is inconvenient to edit, retrieve and mining these document images. The solution is transforming these images into text through optical character recognition techniques. Historical Mongolian docu-ments were printed through woodblock printing techniques, and thus their lay-outs are not strictly neat, word variations are obvious, and ink spreading is a common phenomenon. As time went on, ink fading and flaking occurred. All of these pose challenges for document recognition. In this dissertation, we investi-gated historical Mongolian document recognition, in which the Mongolian Kanjur images served as the handling object. The main contributions are as fol-lows:1. This dissertation proposed an effective method to recognize the words in the historical Mongolian documents. This dissertation analyzed the advantages and limitation of both holistic recognition and segmentation-based recognition. On such basis, we proposed a hybrid strategy to recognize the historical Mongo-lian words according to their characteristics. Several types of words that cannot be segmented into glyph-units were recognized using the holistic scheme, and the remaining ones, which are consisted of several letters, were recognized using the segmentation-based scheme. We determined the criterion of scheme selec-tion through experiment.2. This dissertation proposed a semi-automatic method for training sample selection. A large number of training samples are required to train the convolu-tion neural networks in word recognition. Since manually selecting training samples is a time-consuming work, we put forward a semi-automatic method for training sample selection. This method first classified each word in the unclassi-fied training sample set into one category, and then manually removed the error samples in each category. The remaining words in each category were used as training samples. To train the classifier used in sample selection, three strategies were used, including writer adaptation, generating pseudo samples through morphological operations, and repeated training.3. This dissertation proposed a method of glyph-unit segmentation. While there are many segmentation methods dealing with machine-printed Mongolian words, they do not apply to segmenting the words in historical Mongolian doc-uments since these words show obvious variations and glyph-unit overlapping. We exploited the knowledge of the glyph characteristics and proposed a glyph-unit segmentation method based on contour analysis. First, this method detects significant contour points. Second, it uses the points to facilitate baseline localization. Finally, it generates candidate segmentation paths with significant contour points and baselines. Significant contour points were extracted from the approximating polygons of the external contours of the words, which simplifies the extraction process and limits the disturbance and false tripping caused by noise on the contours.4. This dissertation proposed three knowledge-based strategies to improve word accuracy. In segmentation-based scheme, the recognition results of glyph-units were used to produce the recognition results of the words. The anal-ysis of the recognition results indicates two main causes of word-level errors in segmentation-based recognition. The first category is segmentation errors re-sulted from the segmentation algorithm, and the second category is recognition errors resulted from the classifier in glyph-unit recognition. Three strategies were proposed to improve the word accuracy, including incorporating baseline information (IBI), glyph-unit grouping (GG), and recognizing under-segmented and over-segmented fragments (RUOF).5. This dissertation proposed a method to build the Classical Mongolian dic-tionary. Meanwhile, we used the dictionary and the Glyph-unit neighboring rules derived from Mongolian syllable structures to detect the recognition errors, and evaluated their performance respectively. We adopted weighted edit distance model and the noisy channel model in error correction, and assigned the weights of edit operations according to the recognition result and the characteristics of glyph-units. In addition, we simplified the noisy channel model to reduce the computation cost according to our recognition method.
Keywords/Search Tags:Historical Mongolian Document, Holistic Recognition, Segmenta- tion-based Recognition, Convolution Neural Network, Knowledge-based Strat- egy, Error Correction
PDF Full Text Request
Related items