Study On Script Identification Of Central Asian Print Document Image Based On Texture Features

Posted on:2019-07-10

Degree:Master

Type:Thesis

Country:China

Candidate:X K Han

Full Text:PDF

GTID:2428330566966987

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

An optical character recognition(OCR)system can convert textual information from document images into electronic documents.However,before using the OCR system for text conversion,it is necessary to determine the script of the document to be processed.This step is usually done manually.However,faced with a large amount of document image,manual identification will greatly reduce the automation and efficiency of the system.Therefore,how to extract the characteristics of different scripts of document images and realize the automatic identification of scripts is an important research topic in the field of document image analysis.This paper studies the script identification of Central Asia multi-script document images.Due to the different stroke features,spatial distribution,and structural features of different scripts,document images of different scripts show different textures.According to this feature of multi-script document images,a NSCT-based script identification method is proposed.And for the problem that the similar scripts are difficult to be classified and identified,a script identification method based on fusion texture features is proposed.The main work of this paper is as follows:1.The research history and development status of script identification were introduced,and the existing achievements in this field were summarized.The difficulties and problems that need to be overcome for script identification of Central Asia multi-script document images were analyzed.2.Two standard multi-script document image databases were established,including Arabic,Russian,Tibetan,Chinese,Uyghur,English,Mongolian,Kyrgyzstan,Kazakhstan,and Turkish.3.A NSCT-based script identification method was proposed.A three-level NSCT was performed on the preprocessed document image,and the texture features of the resulting high-frequency and low-frequency sub-graphs were extracted,and the training and classification were performed using different classifiers.The experimental results showed that this method has better recognition effect than thetraditional methods,such as GLCM-based,LBP-based and wavelet transform based script identification methods.4.A NSCT-based fused texture feature for script identification was proposed.The GLCM features and LBP features of each sub-graph generated by NSCT were extracted respectively,and the resulting high-dimensional features were reduced using PCA to obtain a low-dimensional feature.Experiments were performed with different classifiers.The experimental results showed that the proposed method has better performance than traditional script identification methods based on single texture features.5.A script identification method based on NSCT+Tamura fused texture features was proposed.The Tamura features of each sub-graph produced by NSCT were extracted separately,and the SVM classifier was used for training and classification.The experimental results showed that this method has better recognition effect than the NSCT-based script identification method.Finally,the work of this paper was summarized,and the future of script identification of multi-script document image was analyzed and forecasted.

Keywords/Search Tags:

Multi-script document image, Script identification, Feature extraction, Fused texture feature, Non-subsampled Contourlet transform

PDF Full Text Request

Related items

1	Study On Script Identification Of Multi-script Document Image Based On Texture Features
2	Research On Script Identification Based On Texture Feature Of Document Images
3	Study On Script Identification Technology Based Central Asian Multi-lingual Document Images
4	Research On Techniques Of Script Identification Of Document Images
5	Research On Script Identification Of Printed Document Images
6	Texture Image Feature Extraction Based On Multi-Scale Transform Domain Hidden Markov Tree Model
7	A Novel Digital Watermarking Algorithm Based On Scale-invariant Feature Regions In Non-subsampled Contourlet Transform Domain
8	Research On Script Identification In Text Images Based On Deep Learning
9	Classification From Local And Global Perspective For Scene Text Script Identification
10	Research And Its Application Of Image Texture Classification Based On Contourlet Transform And Local Binary Pattern