Research Of Mongolian Historical Document Recognition

Posted on:2012-02-05

Degree:Master

Type:Thesis

Country:China

Candidate:X D Su

Full Text:PDF

GTID:2178330335972279

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

There are many traditional Mongolian historical documents which are reserved in image form currently. The content of these historical documents involves religion, history, culture, artificial, astronomy, geography, Nationality, medicine, and so on.. They are valuable heritage of human. However, the historical documents are difficult for researchers to edit, retrieve, and further statistical analysis. In order to promote the process of Mongolian digitalization, in this paper we explored the Mongolian historical document and proposed an efficient approach to recognition it, in which the Mongolian Sutra made by the emperor order are used as the research subject. This could provide the convenience for the mining and utility of Mongolian historical document, and promote the spread of Mongolian culture.In this paper, we investigate the peculiarities of traditional Mongolian documents and propose an approach to recognize them. In the preprocess stage, we select appropriate methods to do slant correction, binerization, and denoization for Mongolian historical documents according to the specialites of the historical documents. In segmentation stage, we do column segmentation according to the vertical projection, and do word segmentation using the Biggest Connected Component algorithm. We further segment each Mongolian word into several Glyph Units(Glyph Unit abbr. GU). Each GU is consisted of no more than three characters. In feature extraction stage, we extract eight kinds of GU features:LP, Euler number, BD, DCT, DWT, PCA, Con&Pro and EIP. In classification stage, we used a three-step method to recognize the GUs:The first step is that all the GUs are classified into nine groups by decision tree. The second stage is that the GUs in each group are classified individually by five independent BP Neutral Networks whose inputs are five kinds of features of the GUs. The last step is that we combine the five results of each GU group from the above five classifiers to provide the final recognized result. We correct the recognition result with an algorithm based on weighted edit distance and finally generate the coded historical Mongolian documents.In this paper, the test collection is 20 pages of the Mongolian Sutra. The segmentation rate of GU is 96.2%, and the recognition rate of Mongolian words is 71%. Because the Mongolian historical documents are generated by the many xylographers, and the words in them are not standard. Character intersection is an important problem, so that recognition the words is a difficult task. In this paper, we achieved a desired performance.

Keywords/Search Tags:

Mongolian Historical Document, CU Segmentation, Feature Extraction, Classifier Designation, Result Combination

PDF Full Text Request

Related items

1	Research On Retrieval Of Historical Mongolian Document Images
2	Research On Visual Language Model For Historical Mongolian Document Images Retrieval
3	Historical Mongolian Document Recognition Based On Deep Learning And Knowledge Strategies
4	Research On Deep Learning For Historical Mongolian Document Images Retrieval
5	Research On Mongolian Lexical Analysis Based On Combination Of Statistical And Rule Approaches
6	Research Of 3D ROI Segmentation, Feature Extraction And Classification Methods For Pulmonary CAD
7	Residential Area Extraction From Remote Sensing Imagery Based On Multi-classifier Combination
8	Research On Character Extraction And Segmentation Of Chinese Historical Seal Images Based On Multiple Features
9	Multi-classifier Combination And Its Application In Medical Image Classification
10	The Study Of Mongolian Books On The Historical Theme Of Modern Publishing