Font Size: a A A

Study On Script Identification Of Central Asian Print Document Image Based On Texture Features

Posted on:2019-07-10Degree:MasterType:Thesis
Country:ChinaCandidate:X K HanFull Text:PDF
GTID:2428330566966987Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
An optical character recognition(OCR)system can convert textual information from document images into electronic documents.However,before using the OCR system for text conversion,it is necessary to determine the script of the document to be processed.This step is usually done manually.However,faced with a large amount of document image,manual identification will greatly reduce the automation and efficiency of the system.Therefore,how to extract the characteristics of different scripts of document images and realize the automatic identification of scripts is an important research topic in the field of document image analysis.This paper studies the script identification of Central Asia multi-script document images.Due to the different stroke features,spatial distribution,and structural features of different scripts,document images of different scripts show different textures.According to this feature of multi-script document images,a NSCT-based script identification method is proposed.And for the problem that the similar scripts are difficult to be classified and identified,a script identification method based on fusion texture features is proposed.The main work of this paper is as follows:1.The research history and development status of script identification were introduced,and the existing achievements in this field were summarized.The difficulties and problems that need to be overcome for script identification of Central Asia multi-script document images were analyzed.2.Two standard multi-script document image databases were established,including Arabic,Russian,Tibetan,Chinese,Uyghur,English,Mongolian,Kyrgyzstan,Kazakhstan,and Turkish.3.A NSCT-based script identification method was proposed.A three-level NSCT was performed on the preprocessed document image,and the texture features of the resulting high-frequency and low-frequency sub-graphs were extracted,and the training and classification were performed using different classifiers.The experimental results showed that this method has better recognition effect than thetraditional methods,such as GLCM-based,LBP-based and wavelet transform based script identification methods.4.A NSCT-based fused texture feature for script identification was proposed.The GLCM features and LBP features of each sub-graph generated by NSCT were extracted respectively,and the resulting high-dimensional features were reduced using PCA to obtain a low-dimensional feature.Experiments were performed with different classifiers.The experimental results showed that the proposed method has better performance than traditional script identification methods based on single texture features.5.A script identification method based on NSCT+Tamura fused texture features was proposed.The Tamura features of each sub-graph produced by NSCT were extracted separately,and the SVM classifier was used for training and classification.The experimental results showed that this method has better recognition effect than the NSCT-based script identification method.Finally,the work of this paper was summarized,and the future of script identification of multi-script document image was analyzed and forecasted.
Keywords/Search Tags:Multi-script document image, Script identification, Feature extraction, Fused texture feature, Non-subsampled Contourlet transform
PDF Full Text Request
Related items