Font Size: a A A

Study On Script Identification Of Multi-script Document Image Based On Texture Features

Posted on:2020-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:S LiFull Text:PDF
GTID:2428330590454688Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the advent of the information age,more and more data is stored in the form of text images in a digital environment.In the process of globalization,exchanges between countries have become more frequent.In the process of massive information processing,Optical Character Recognition(OCR)has been widely used.The script identification technology is the front-end processing technology of OCR,and it is also an important part of text image analysis.It has become a research hotspot.The research on script identification has achieved many important achievements since the beginning of 1990.Most of the research databases contain only script from some regions,and the amount of data is small,so it is not certain that it is applicable to more scripts.In view of some problems in the method of script identification,this paper established a multi-script document image database.The selection of the script includes global scripts,Central Asian languages and domestic minority scripts,which have universal applicability.This paper studies the multi-script document image identification technology.Different document images will show different texture features for a series of differences in structural features,stroke writing features and spatial distribution of different scripts.A script identification method based on curvelet transform and a script identification method based on HOG feature are proposed.In order to improve the identification rate of single texture features,a script identification method based fusion texture features is proposed.The main work done in this paper is as follows:1.The research history and development status of the field were introduced,and the outstanding research results obtained in this field were summarized.The difficulty in researching multi-script document image identification techniques is analyzed.2.A standard multilingual document image database is created.The image resolution is 200 dpi and the image size is 256*256.The database contains 9 scripts:Chinese,Russian,English,Turkish,Kyrgyz,Kazakh,Tibetan,Uighur and Mongolian.There are 1000 images in each script.3.In view of the database we have built,some of the scanned books are soft andhave a photocopy of the other side.The weighted average method grayscale,median filter denoising and global threshold binarization are used to preprocess the document,so as to achieve the same background and noise reduction effect of the binarized image before feature extraction.4.A script identification method based on curvelet transform is proposed.The energy characteristics of the curvelet coefficients obtained by the curvelet transform are extracted to form a feature vector.Bayes?linear discriminant analysis and SVM were used to train and classify.The experimental results showed that the method was better than the traditional script identification methods,such as wavelet transform,dual-tree complex wavelet transform,LBP and other script identification methods.5.A script identification method based on HOG feature is proposed.The gradient direction histogram of the local area of the document image is calculated and counted,and constitutes a feature vector.The eigenvectors are trained and classified using different classifiers.Compared with the classical methods,the experimental results showed that the feature extraction time was short,and the texture features of the document image can be accurately extracted,which effectively improved the identification rate of the script.6.A script identification method based on fused texture feature of curvelet transform sub-bands was proposed.The texture features of the high-frequency sub-band and the low-frequency sub-band of curvelet transform were extracted,and the statistical features of the image were fused to form feature vector.The feature vectors are trained and classified using different classifiers.The experimental results showed that the texture feature of document images can be accurately extracted.This method improved the efficiency of the script identification.In the end,the current work of this paper was summarized,and the future of script identification of document image was analyzed and forecasted.
Keywords/Search Tags:The script identification technology, Multi-script document image, Curvelet transform, HOG feature, Fused texture feature
PDF Full Text Request
Related items