Font Size: a A A

Study On Script Identification Technology Based Central Asian Multi-lingual Document Images

Posted on:2018-05-14Degree:MasterType:Thesis
Country:ChinaCandidate:A J G L M J T BuFull Text:PDF
GTID:2348330533956497Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Many similar shaped scripts are used all over the world today.In recent years,processing of digital files has becoming increasingly popular in applications such as office and library automation,banking and postal services,publishing houses and communications management.The development of multilingual OCR recognition systems has become an urgent problem for the increasing demand for tools that can search for written and verbal sources of multilingual information and need to be solved.Before implementing the multilingual OCR system,it is necessary to study the identification of multilingual document images scripts and to supply OCR systems.Meanwhile,script identification with similar shaped characters is difficult task in pattern recognition area.The main research of this paper is the extraction of multi-features based on multi-lingual document text images script identification technology.The main contributions of this paper are as follows:(1)In order to verify the effectiveness and stability of the algorithm,three multi-lingual document text images databases with different resolutions are established firstly,with 1600,2200(100 dpi of resolution)and 2200(200 dpi of resolution)hole page document text images respectively,including 11 scripts of English,Chinese,Russian,Mongolian,Tibetan,Uygur,Turkish,Kyrgyzstan,Tajikistan and Kazakhstan.(2)A script identification system based on multi-lingual document text images with HSV feature and BP neural network classifier is realized.(3)The Tamura feature and the texture features which composed of six Eigen values are extracted.And these features are classified using different six classifiers,the classification results are counted finally.(4)It was determined the optimal weights suitable for identification of central Asian multilingual scripts after the weighted fusion method was proposed to extract the fusion features.(5)The Hu invariant moment feature is extracted;meanwhile it was classified using different classifier such as Bayesian,Euclidean distance,Mahalanobis distance and LDA.(6)Finally,the identification method which combination of Hu invariant moments,Tamura features and texture features was proposed,and better recognition results are obtained based on fusion features.It was obtained 99.38%,95.69% and 98.64%of highest average identification rate with the three established dataset respectively.Experimental results indicated that features extracted in this paper can better describe the multi-script document images,and they can effectively classify these 11 kinds scripts mentioned above.Especially it has certain advantages and stability to identify the Central Asian similar scripts and Chinese minority scripts of the text document images.
Keywords/Search Tags:Script Identification, Feature Extraction, Weighted Fusion, Multi-lingual Document Text Image, Similar Scripts
PDF Full Text Request
Related items