Font Size: a A A

Research Of Mongolian Historical Document Recognition

Posted on:2012-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:X D SuFull Text:PDF
GTID:2178330335972279Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
There are many traditional Mongolian historical documents which are reserved in image form currently. The content of these historical documents involves religion, history, culture, artificial, astronomy, geography, Nationality, medicine, and so on.. They are valuable heritage of human. However, the historical documents are difficult for researchers to edit, retrieve, and further statistical analysis. In order to promote the process of Mongolian digitalization, in this paper we explored the Mongolian historical document and proposed an efficient approach to recognition it, in which the Mongolian Sutra made by the emperor order are used as the research subject. This could provide the convenience for the mining and utility of Mongolian historical document, and promote the spread of Mongolian culture.In this paper, we investigate the peculiarities of traditional Mongolian documents and propose an approach to recognize them. In the preprocess stage, we select appropriate methods to do slant correction, binerization, and denoization for Mongolian historical documents according to the specialites of the historical documents. In segmentation stage, we do column segmentation according to the vertical projection, and do word segmentation using the Biggest Connected Component algorithm. We further segment each Mongolian word into several Glyph Units(Glyph Unit abbr. GU). Each GU is consisted of no more than three characters. In feature extraction stage, we extract eight kinds of GU features:LP, Euler number, BD, DCT, DWT, PCA, Con&Pro and EIP. In classification stage, we used a three-step method to recognize the GUs:The first step is that all the GUs are classified into nine groups by decision tree. The second stage is that the GUs in each group are classified individually by five independent BP Neutral Networks whose inputs are five kinds of features of the GUs. The last step is that we combine the five results of each GU group from the above five classifiers to provide the final recognized result. We correct the recognition result with an algorithm based on weighted edit distance and finally generate the coded historical Mongolian documents.In this paper, the test collection is 20 pages of the Mongolian Sutra. The segmentation rate of GU is 96.2%, and the recognition rate of Mongolian words is 71%. Because the Mongolian historical documents are generated by the many xylographers, and the words in them are not standard. Character intersection is an important problem, so that recognition the words is a difficult task. In this paper, we achieved a desired performance.
Keywords/Search Tags:Mongolian Historical Document, CU Segmentation, Feature Extraction, Classifier Designation, Result Combination
PDF Full Text Request
Related items