Study On Script Identification Technology Based Central Asian Multi-lingual Document Images

Posted on:2018-05-14

Degree:Master

Type:Thesis

Country:China

Candidate:A J G L M J T Bu

Full Text:PDF

GTID:2348330533956497

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

Many similar shaped scripts are used all over the world today.In recent years,processing of digital files has becoming increasingly popular in applications such as office and library automation,banking and postal services,publishing houses and communications management.The development of multilingual OCR recognition systems has become an urgent problem for the increasing demand for tools that can search for written and verbal sources of multilingual information and need to be solved.Before implementing the multilingual OCR system,it is necessary to study the identification of multilingual document images scripts and to supply OCR systems.Meanwhile,script identification with similar shaped characters is difficult task in pattern recognition area.The main research of this paper is the extraction of multi-features based on multi-lingual document text images script identification technology.The main contributions of this paper are as follows:(1)In order to verify the effectiveness and stability of the algorithm,three multi-lingual document text images databases with different resolutions are established firstly,with 1600,2200(100 dpi of resolution)and 2200(200 dpi of resolution)hole page document text images respectively,including 11 scripts of English,Chinese,Russian,Mongolian,Tibetan,Uygur,Turkish,Kyrgyzstan,Tajikistan and Kazakhstan.(2)A script identification system based on multi-lingual document text images with HSV feature and BP neural network classifier is realized.(3)The Tamura feature and the texture features which composed of six Eigen values are extracted.And these features are classified using different six classifiers,the classification results are counted finally.(4)It was determined the optimal weights suitable for identification of central Asian multilingual scripts after the weighted fusion method was proposed to extract the fusion features.(5)The Hu invariant moment feature is extracted;meanwhile it was classified using different classifier such as Bayesian,Euclidean distance,Mahalanobis distance and LDA.(6)Finally,the identification method which combination of Hu invariant moments,Tamura features and texture features was proposed,and better recognition results are obtained based on fusion features.It was obtained 99.38%,95.69% and 98.64%of highest average identification rate with the three established dataset respectively.Experimental results indicated that features extracted in this paper can better describe the multi-script document images,and they can effectively classify these 11 kinds scripts mentioned above.Especially it has certain advantages and stability to identify the Central Asian similar scripts and Chinese minority scripts of the text document images.

Keywords/Search Tags:

Script Identification, Feature Extraction, Weighted Fusion, Multi-lingual Document Text Image, Similar Scripts

PDF Full Text Request

Related items

1	Study On Script Identification Of Central Asian Print Document Image Based On Texture Features
2	Study On Script Identification Of Multi-script Document Image Based On Texture Features
3	Research On Script Identification In Text Images Based On Deep Learning
4	Research On Script Identification Based On Texture Feature Of Document Images
5	Research On Text Detection And Multi-script Identification In Natural Images Based On Machine Learning
6	Research On Techniques Of Script Identification Of Document Images
7	Classification From Local And Global Perspective For Scene Text Script Identification
8	Research On Multi-Script Identification In Natural Images
9	Research On Image Classification Based On Weighted Multi-features Fusion And SVM
10	The Research Of Image Mosaic Technology Based On Weighted Poisson Fusion