Font Size: a A A

Research And Implementation Of Layout Analysis And Post-processing For Mongolian Document Images

Posted on:2018-12-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y W WangFull Text:PDF
GTID:2348330515955347Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,the research about Optical Character Recognition(OCR)technology has been in the rapid development.Chinese and English character recognition technologies have achieved remarkable results.Character recognition accuracy is the most important performance metric for a character recognition system.For Machine-printed Mongolian character recognition system,it is essential to study and realize the layout analysis and post-processing so as to improve the performance.Accordingly,the main research work in this dissertation includes the following two aspects:(1)Layout analysis of Mongolian document images;(2)Post-processing of Mongolian character recognition.In the process of Machine-printed Mongolian character recognition,layout analysis is an important pre-processing step.But,there are few studies on the layout analysis for Mongolian document images.Besides,the layout of Mongolian document images is varied.Some complex layouts contain texts,pictures and tables,which results in layout analysis being a very difficult task.In this dissertation,a layout analysis method has been proposed by combining the bottom-up based scheme with the top-down based scheme.To be specific,all connected components are obtained.And then,the connected components are merged according to some rules.Finally,non-text connected components are removed.After that,the rest of connected components are segmented into paragraphs.In the above-mentioned procedure,the coordinates of each paragraph are reserved in order to provide the corresponding information at the layout reconstruction stage.At present,in the Mongolian character recognition system,the results after segmentation and recognition are encoded by glyphs.In order to obtain the international standard encoding,a post-processing step is utilized.Therein,the recognition results can be converted into the international standard encoding using a conversion dictionary from the glyph encoding scheme to the international standard encoding scheme.To generate the conversion dictionary,50553 Mongolian words encoded by the international standard encoding scheme are copied into a WORD document.And then,the corresponding PDF document is generated by the WORD document.Next,each page of the PDF document is saved as an image.Each image can be recognized by the Mongolian character recognition system.In this way,the conversion dictionary can be formed.And it is used to accomplish the aim of post-processing.In this dissertation,the proposed layout analysis approach can handle a variety of complex layouts for Mongolian document images.The accuracy of the layout analysis is attained to 97.87%on a testing set.The proposed post-processing approach can convert the recognition results from the glyph encoding scheme into the international standard encoding scheme,which makes the Machine-printed Mongolian character recognition system more practical.
Keywords/Search Tags:Mongolian character recognition, Mongolian document images, layout analysis, post-processing, encoding conversion
PDF Full Text Request
Related items