Font Size: a A A

Research And Implementation Of Multilingual Recognition Technology In Machine-printed Mongolian Documents

Posted on:2018-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:L M YangFull Text:PDF
GTID:2348330515955333Subject:Software engineering
Abstract/Summary:PDF Full Text Request
At present,there are many character recognition systems can recognize the single language.However,in the trend of global integration,there are a variety of different languages appears in one document.Therefore,many Mongolian documents include not only Mongolian words,but also mixed with a certain number of Chinese characters and English words.Therefore,it is essential for us to design a multilingual recognition system.A multilingual recognition technology has been proposed in this dissertation,which can be divided into two steps:pre-processing and script identification.The procedure of pre-processing is as follows.First,the text regions and non-text regions are separated each other.And then,the text regions are divided into paragraphs.Next,vertical projection and Gaussian smoothing are used to segment paragraph into columns.Finally,a connected components analysis based method is used to obtain individual word images.In the above-mentioned procedure,the coordinates of word images in the original document image are reserved so as to reconstruct the layout.A multilingual recognition method has been presented in this dissertation,which includes a coarse classifier and a fine classifier.In the coarse classifier,word images can be classified into Mongolian,Chinese and English according to widths,heights and other information.In case of Chinese,a certain amount of English and Mongolian words may be similar to the Chinese characters.In case of English,a certain amount of Chinese characters and Mongolian words may be similar to the English words.Therefore,a fine classifier should be utilized in the case of Chinese and English,separately.In the fine classifier,convolutional neural networks(CNN)are used for distinguishing each kind of languages including Chinese characters,English words,and other Punctuations.On a testing dataset,the accuracy rates of the column segmentation and word segmentation are 99.13%and 97.87%,respectively.For the fine classifier,the average recognition accuracy rate is 99.41%in case of Chinese characters.In case of English,the average recognition accuracy rate is 98.86%.For the punctuations,the average recognition accuracy rate is 98.34%.
Keywords/Search Tags:Mongolian, Document Images, Pre-processing, Script Identification, Convolution Neural Networks
PDF Full Text Request
Related items