Font Size: a A A

Research On Holistic Recongnition Technology For Words Of Historical Mongolian Documents Based On Deep Learning

Posted on:2019-08-22Degree:MasterType:Thesis
Country:ChinaCandidate:X LiuFull Text:PDF
GTID:2405330563457203Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Historical Mongolian documents are important resources for the study of Mongolian history and culture.In order to strengthen the protection of historical Mongolian documents and improve the utilization rate of historical Mongolian documents in the study,more and more historical Mongolian documents have been transformed into digital image form.However,the images of historical Mongolian documents can not be directly edited,and can not be effectively analyzed,counted and retrieved.Therefore,it is a very meaningful work to recognize the images of historical Mongolian documents and convert them into editable electronic documents.But,the historical Mongolian documents are printed by woods where words were carved by different craftsmen,so the same word of historical Mongolian documents has difference in shape.In addition,the historical Mongolian documents have a long history.Therefore,there are different degrees of stain,loss and fading,which leads to the low quality of the word images of historical Mongolian documents.It brings great inconvenience to the recognition of the word images of historical Mongolian documents.Nowadays,there are relatively few researches on recognition of historical Mongolian documents,and it is mainly about the segmentation-based approach,that is,the words to be recognized are segmented into the corresponding glyphs,which are used as the recognition unit.Another approach is called multi-knowledge strategy method which is based on the segmentation approach.It often has the following problems: First,it can only be applied to high-quality and less-noisy historical Mongolian documents word images.Second,the segmentation accuracy of the character is a key factor in determining the recognition result.This paper mainly studies the holistic recognition technology of historical Mongolian document word images based on deep learning.The purpose is to solve the problem that the word images are difficult to be effectively segmented and sensitive to noise,hoping to further improve the recognition accuracy of historical Mongolian documents word images.The historical Mongolian documents named Kanjur are used as material to explore the word images recognition technology of historical Mongolian documents.The main contents of this paper are as follows:(1)The experimental material used in this paper are from the digitized Kanjur which is stored in the Library of Inner Mongolia University.We randomly selected 100 pages from the Kanjur.After the layout analysis,binarization,we cut them into 20176 word images as experimental samples,and divided them into 1336 categories,and each sample is labeled manually.Due to the problem of too few samples in some classes,the SMOTE technique(Synthetic Minority Over-sampling Technique)is used to expand the samples.After extension,the size of the total samples is up to 267200,which is used to be the new data set of this experiment.(2)In view of the poor quality of historical Mongolian documents word images,and problem of word images cannot to be segmented correctly.According to the excellent performance of deep learning in the field of image recognition,the classical convolution neural network LeNet-5 is used as the basic model of the experiment.In this paper,a holistic recognition approach for words of historical Mongolian documents based on an improved convolution neural network(CNN)and a holistic recognition approach for words of historical Mongolian documents based on the recurrent neural network(RNN)are proposed.And by changing the size of the image,the number of training,the balance of the data set distribution and the size of the input particle in recurrent neural network to observe the factors in affecting the model recognition ability.Also,the performance of the convolutional neural network and the recurrent neural network is compared with the experimental results.(3)To the problem of out-of-vocabulary word recognition.Combined with the superiority of Long Short-Term Memory(LSTM),this paper proposes the holistic recognition technology for words of historical Mongolian documents based on CNN-LSTM.It makes the recognition accuracy reach 84.5%.Although,the recognition accuracy is lower than CNN or RNN,it can successfully solves the problem of out-of-vocabulary words.
Keywords/Search Tags:historical Mongolian documents, holistic word recognition, deep learning, CNN, LSTM, SMOTE
PDF Full Text Request
Related items