Font Size: a A A

Research On Uyghur Recognition Technology Based On Word Part

Posted on:2021-08-05Degree:MasterType:Thesis
Country:ChinaCandidate:N LiFull Text:PDF
GTID:2518306047484944Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
At present,the study of Uyghur can be divided into two directions: handwritten and printed.The printed Uyghur identification aims to digitize paper documents,which is conducive to the inheritance and development of Uyghur culture,the development of information technology in Xinjiang,and the development of national unity.The study of printed Uyghur can be divided into three directions: character-based,word-based,and word part-based.In this paper,the word part is used as the recognition object,and classic OCR technology and deep learning network models are used to research the printed Uyghur word part,and a complete set of Uyghur recognition translation system is developed.The specific research content is as follows:1.Establishing printed Uyghur word part database.By counting and arranging the existing word database in the laboratory,counting the word part included in 4261 common words and eliminating duplicate word part.This article sorts out 1792 common word part,and establishes 50 sets of word part database samples with different fonts and different font sizes;2.In view of the small gap between the word part and the possibility of slight adhesion,if the simple integral projection method is used,there will be missed and over-segmented phenomena,resulting in inaccurate segmentation.In order to accurately divide the word part,this paper uses a segmentation method based on the positional relationship to merge the connected domains.This method traverses the connected domains and merges according to the common positional relationship between the main body and the additional body of the word part,and finally obtains an independent word part.Experiments show that the connected domain merging method can accurately divide the word part of printed Uyghur documents of different sizes and font sizes,and the accuracy of the segmentation can be close to 100%;3.In this paper,we study the printed Uyghur word part recognition technology based on classical OCR,which includes five major modules: document image preprocessing,document image segmentation,word part image preprocessing,feature extraction,and classification recognition.Experiments show that the top 1 recognition accuracy of printed Uyghur word part recognition based on classic OCR is not less than 92.97%,and the recognition rate is not less than 95.36ms/word part;4.In this paper,we study the printed Uyghur word part recognition technology based on deep learning.A 7-layer convolutional neural network model is designed based on the LeNet-5 model and AlexNet model,and the generalization ability of the model network is improved through data augmentation.Experiments show that the printed Uyghur word part recognition based on convolutional neural network has achieved a satisfactory recognition accuracy,and the top 1 recognition accuracy is stable at 99%;5.Based on the above research,a set of printed Uyghur recognition and translation system is designed in this paper,which can recognize or translate printed Uyghur documents by connecting a scanner or opening pictures.The output of recognition operation is an editable Uighur word part,and the output of translation operation is a Chinese translation corresponding to Uighur words.
Keywords/Search Tags:Uyghur word part, Word part segmentation, Optical character recognition, Convolutional neural networks model
PDF Full Text Request
Related items