Font Size: a A A

Research On Script Identification Based On Texture Feature Of Document Images

Posted on:2010-10-31Degree:MasterType:Thesis
Country:ChinaCandidate:L J GuFull Text:PDF
GTID:2178330332478440Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the rapid development of network communication technology and information processing technology, document images have become important source for attaining information. For the intercommunications among countries are more frequent, many languages or scripts need to be identified and processed. Script identification is significant for attaining information from document images effectively. This dissertation mainly works on script identification based on texture feature of document images. The main work is as following:1. The features especially texture features of document images are deeply studied. The development history and researching state of script identification are introduced. The fruits that have got and difficulties that are faced are pointed out.2. A script identification algorithm based on multi-wavelet transform is proposed. The energies of sub images after multi-wavelet decomposition are used as features and SVM is used as classifier. Experimental results confirm the proposed algorithm is more excellent than the one based on wavelet. It's especially robust to the changes of font and format of characters.3. Most algorithms on texture feature extraction for script identification are unadaptable to the skew of text line presently. To obtain features robust to rotation, texture units consisting of characters are decomposed by Steerable Pyramid and the energy features of sub bands are studied deeply. An algorithm robust to the skew of text line is proposed through realigning the energy statistical features. The experiments are performed on the image database containing ten scripts with different skew angles. The results confirm that the algorithm can identify scripts accurately and is robust to the skew of text line at the same time.4. Aiming at the orientation of characters and the abundant texture features of character edges, algorithms based on multi-scale geometric analysis are proposed. Document images are decomposed by Contourlet and complex Contourlet transform. Energy features of sub bands are extracted. At the same time, sub bands of Contourlet transform are modeled by Generalized Gaussian Model and model parameters are used as features. SVM is used as classifier. The experiments done on image database containing fifteen scripts confirm the proposed algorithms improve the identification performance on the scripts whose vision features are similar.5. A script identification method identifying scripts by steps is proposed. Fourteen scripts are identified by two steps. The text line projection algorithm is used in the first step for coarse identification and the algorithm based on texture feature is used in the second step for fine identification. This method is efficient with small error accumulation. It is very practical for it can select algorithm according as characteristic of script and select step according as application requirement.
Keywords/Search Tags:document images, script identification, texture feature, multi-wavelet transform, Steerable Pyramid transform, Contourlet transform, Generalized Gaussian Model
PDF Full Text Request
Related items