Font Size: a A A

Research On Deep Learning Based Script Identification Method Of Korean History Documents

Posted on:2020-06-01Degree:MasterType:Thesis
Country:ChinaCandidate:X C LiuFull Text:PDF
GTID:2428330572989359Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Korean history documents after the fifteenth century contain not only Korean characters but also Chinese characters,Manchu,Mongolian and other characters.The composed type of these documents mostly takes the form of multilingual mixed typesetting.This feature brings difficulties to character segmentation and recognition.It affects the digitization process of Korean history documents.The complexity of the character recognition algorithm will increase,and the accuracy will reduce if the algorithm recognizes unclassified multilingual text images directly.A large number of studies have proved that it is difficult to find a universal layout analysis and processing algorithm for text images.Therefore,the character segmentation algorithm of multilingual mixed history documents is still one of the unsolved problems in the field of text segmentation,which has a certain research significance and practical application value.To promote the digitization process of ancient Korean books,this dissertation proposes a character segmentation method during the process of script identification.This character segmentation method is suitable for Korean history documents with multi-language mixing arrangement,different font sizes,a large variation of font spacing and complex adhesion.The characteristics of Korean history documents include multilingual mixed writing,different character sizes,a large variation of font spacing and complicated overlap characters.This dissertation proposes a character segmentation method suitable for Korean history documents.Firstly,a text column segmentation algorithm based on connected component rule and projection method was proposed.This algorithm can not only remove the separating lines existing between columns but also segment adhesion text column.It solves the problems of discontinuity,skew separator line or characters overlap between columns.Secondly,a multi-step character segmentation algorithm based on connected component rules was proposed,which contains rough segmentation and fine segmentation.This algorithm has a good effect on history documents which is multilingual with the horizontal and vertical mixed arrangement and has different size of characters.Aiming at the image with the unknown number of overlap characters and unknown direction of overlap characters,an improved recursive drop fall algorithm using was K-means proposed.It segments images with the unknown number of multiple-touching character correctly.Then,the character image database of similar scripts in ancient books was established by using the segmented character images.Finally,this dissertation studies the script identification of Korean and Chinese character images in similar scripts database and using Inception-v4 convolution neural network to identify efficiently.This method solves the problem that traditional machine learning has a high error rate in the identification of similar languages such as Korean and Chinese.The multi-script identification provides an accurate and reliable dataset for research on Korean and Chinese character recognition algorithms in the future.The experimental results show that the accuracy of the column segmentation algorithm is 97.69%,the character segmentation algorithm is 87.79%,and the accuracy of the script identification in character-level is 99.40%.It is proved that the column segmentation algorithm and character segmentation algorithm proposed in this dissertation can effectively accomplish the segmentation of history documents images with multi-language and complex typesetting.Meanwhile,the method of script identification based on convolution neural network(CNN)has a good effect on identifying Korean and Chinese ancient books images with a lot of noise.
Keywords/Search Tags:Ancient books digitalization, Korean historical documents, multi-script identification, character segmentation
PDF Full Text Request
Related items