Font Size: a A A

Research On Key Technology Of Mongolian Movable Type Newspaper Image Recognition

Posted on:2023-03-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:M LuFull Text:PDF
GTID:1525306794486974Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The invention of Mongolian movable type printing technology has greatly influenced the development of Mongolian culture and education in our country.Movabletype printing achieved mass and large-scale production mode,which greatly reduced the printing cost,shortened the working time and improved the printing efficiency.However,due to the scarce of the annotated image data and immaturity of digitalization technology of Mongolian printed newspapers,Mongolian printed newspapers and books are still preserved mainly in the form of paper and scanned images,which is not conducive to their long-term preservation and development.Therefore,how to transcribe documents,newspapers and books into electronic documents for long-term preservation and secondary excavation has become an urgent problem to be solved.The digitalization of Mongolian movable-type newspapers mainly faces the following challenges: A large amount of label noise in image transcribed text in the process of data collection will lead to the decrease of model generalization ability;Different from regular printed document images(simple layout and regular regions),lead movabletype newspapers have various layout elements,complex and compact layout,and there are problems such as blurred handwriting,distorted text,fading ink,leakage and newspaper damage due to age.It is a great challenge for image layout analysis and text recognition.In addition,Mongolian text has a large number of correctly shaped but incorrectly coded words.The existing isolated word correction methods based on dictionaries and rules cannot solve the errors of extraneous words and syntactic semantic errors.Aiming at the above problems,this paper studies the key technologies in the task of newspaper image digitization: data set construction,image layout analysis,text recognition,label noise processing and text correction.The research content and main work include the following four aspects:(1)To solve the problem of scarce data resources,this paper constructed the image layout analysis dataset of Mongolian movable type newspapers,the text alignment corpus of three text granularity,and text corpus for spelling correction respectively to support the image layout segmentation,text recognition,post-recognition and spelling correction tasks of the training set corpus.Aletheia,a semi-automatic page annotation tool,was used to annotate and classify page areas.The text alignment corpus was constructed by manual transcribed text and pre-trained word recognition model.Text correction corpus including stem and affix corpus,polyphonaic word corpus and whole word corpus,is constructed by manual input,network crawling,automatic correction and other methods.(2)Panoptic-DLA,which is based on proposal-free panoptic segmentation,is proposed to solve the classic difficulties in layout segmentation,such as the false merger of adjacent areas and missing detection of layout area.Instead of considering layout analysis as a separate object detection or semantic segmentation problem,this framework defines layout analysis task as proposal-free panoptic segmentation task,and assigns semantic label and instance category to each pixel in the document image through two decoupled branches of semantic segmentation and instance segmentation.The semantic segmentation branch uses Deep Lab v3+ to predict the semantic category of foreground and background pixels at the pixel level,which improves the detection rate of foreground pixels.The segmentation branch is used to model the text center-boundary probability and text center-direction,and the region shape information is added to the model.Experimental results on the layout analysis dataset and two other public evaluation datasets show that the proposed method achieves the highest foreground pixel detection rate and effectively reduces the occurrence of merging errors.(3)To solve the problem of the negative influence of noise labels in training sets on the model generalization ability,a label noise detection method in course network instruction--MSL-Mentor Net is proposed.The course network takes the loss value sequence of samples with multiple time steps as the standard to measure the difficulty of samples.The results on newspaper image recognition dataset and handwritten digit recognition dataset MNIST show that the proposed method achieves better label noise detection performance than the “small loss” heuristic sample selection or weighting method which only focuses on the current state of the model.In addition,the TRBA(TPS + Res Net + Bi LSTM + Attention)framework is proposed as a strong baseline system for the recognition task of lead movable-type newspapers.In this network,a transformation layer is added before the encoder to regularize the image,and an implicit language model is added in the decoding stage to model the context semantics.Based on the image-text alignment corpus,paragraph text recognition are implemented respectively at two text granularity frameworks of word level and text line level.Experimental results show that the translation layer and language model can effectively improve the performance of text recognition.(4)A multi-module coding correction system combining rules,dictionaries and deep learning is proposed to solve the problem of correct glyphs but wrong coding words caused by polyphonic shapes in the Mongolian international standard coding text.The system solves the problem that isolated word recognition and statistical model are limited in correcting syntactic and semantic errors and sparse words,and completes the spelling correction of monophonic words,extrinsic words,polyphonic words and case suffixes in turn.The specific steps are as follows: Firstly,monophonic word correction is completed through dictionary matching,and candidate word sets of polyphonic words and case suffixes are provided with the same shape;Secondly,Evolved Transformer is utilized to solve out of vocabulary issue.Then,context2 vec is used to complete the task of word sense disambiguation of polyphonic words,which improves the correction accuracy of low-frequency words and effectively solves the problem of sparse corpus of polyphonic words.Finally,the suffix correction is completed based on the rule method.The experimental results show that the spelling accuracy is improved by 13.18% and 4.14%,respectively,after the automatic correction of the text after the training corpus and word-level recognition.To sum up,this paper conducts a deep research and discussion on four key issues,including data set construction,layout analysis,character recognition and text correction,in the process of image digitalization of Mongolian movable lead type newspaper.The proposed method is tested on several public data sets and good results are obtained.It shows that the methods proposed in this paper can also be used for reference in other language document image digitization tasks.
Keywords/Search Tags:Mongolian movable type newspaper, dataset, layout analysis, character recognition, label noise detection, spelling correction
PDF Full Text Request
Related items