Research And Implementation Of Multilingual Recognition Technology In Machine-printed Mongolian Documents

Posted on:2018-04-24

Degree:Master

Type:Thesis

Country:China

Candidate:L M Yang

Full Text:PDF

GTID:2348330515955333

Subject:Software engineering

Abstract/Summary:

At present,there are many character recognition systems can recognize the single language.However,in the trend of global integration,there are a variety of different languages appears in one document.Therefore,many Mongolian documents include not only Mongolian words,but also mixed with a certain number of Chinese characters and English words.Therefore,it is essential for us to design a multilingual recognition system.A multilingual recognition technology has been proposed in this dissertation,which can be divided into two steps:pre-processing and script identification.The procedure of pre-processing is as follows.First,the text regions and non-text regions are separated each other.And then,the text regions are divided into paragraphs.Next,vertical projection and Gaussian smoothing are used to segment paragraph into columns.Finally,a connected components analysis based method is used to obtain individual word images.In the above-mentioned procedure,the coordinates of word images in the original document image are reserved so as to reconstruct the layout.A multilingual recognition method has been presented in this dissertation,which includes a coarse classifier and a fine classifier.In the coarse classifier,word images can be classified into Mongolian,Chinese and English according to widths,heights and other information.In case of Chinese,a certain amount of English and Mongolian words may be similar to the Chinese characters.In case of English,a certain amount of Chinese characters and Mongolian words may be similar to the English words.Therefore,a fine classifier should be utilized in the case of Chinese and English,separately.In the fine classifier,convolutional neural networks(CNN)are used for distinguishing each kind of languages including Chinese characters,English words,and other Punctuations.On a testing dataset,the accuracy rates of the column segmentation and word segmentation are 99.13%and 97.87%,respectively.For the fine classifier,the average recognition accuracy rate is 99.41%in case of Chinese characters.In case of English,the average recognition accuracy rate is 98.86%.For the punctuations,the average recognition accuracy rate is 98.34%.

Keywords/Search Tags:

Mongolian, Document Images, Pre-processing, Script Identification, Convolution Neural Networks

Related items

1	Research On Techniques Of Script Identification Of Document Images
2	Research On Script Identification Of Printed Document Images
3	Research And Implementation Of Layout Analysis And Post-processing For Mongolian Document Images
4	Research On Script Identification Based On Texture Feature Of Document Images
5	Study On Script Identification Technology Based Central Asian Multi-lingual Document Images
6	Design And Research Of Testing Subsystem Of English CAI System Based On Mongolian Script
7	The Editor For The Mongolian Script Based On Unicode
8	Study On Script Identification Of Central Asian Print Document Image Based On Texture Features
9	Study On Script Identification Of Multi-script Document Image Based On Texture Features
10	Research On Script Identification In Text Images Based On Deep Learning