Font Size: a A A

Research On Multi-Script Identification In Natural Images

Posted on:2015-06-14Degree:MasterType:Thesis
Country:ChinaCandidate:M J PiaoFull Text:PDF
GTID:2298330431979186Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Multi-script identification in natural images is a very important research issue in the field of contents-based image retrieval and development of multi-language OCR system. With the development of information industry, the amount of digital images has rapidly increased. It is of important significance and widely applicable value to retrieve objects from masses of stored images. However, to accurately and quickly retrieve images from large-scale database is still to be solved. Up to now, most OCR systems are trained by a single language, therefore, for unknown languages or multi-script, the existing OCR systems will lose effectiveness. In natural images, the characters are different in amounts, fonts, size, coverage area and text space etc. Therefore, the existing multi-script identification methods for text images lack of flexibility. To solve the problem, a multi-script identification method in natural images based on text edge density, text arrangement rules and PCA method was proposed in this dissertation.First of all, a text detection algorithm was presented, which combined the characteristics of text edge density and text arrangement. In algorithm, Sobel gradient operator was employed to detect image edge and then the image edge density was obtained. After the preprocessing of image edge by morphological method, text areas was detected by means of prior hypotheses for text arrangement.Then, a multi-script identification method based on PCA was put forward. The first step of the method was to make character sample set of Korean, Chinese and English. Furthermore, corresponding Eigen space was built by PCA method. At last, the script language was identified by measuring similarity between the original character and reconstructed character according to Euclidean distance and KL distance.Finally, algorithm of multi-script identification in natural images was designed by combining the above text detection method and multi-script identification method.The success rate of the proposed method based on text edge density, text arrangement and PCA method, as observed experimentally, are88.36%and87.37%for text detection and multi-script identification respectively. It is very effective to identify the language type of detected text region which includes Korean, Chinese and English, in natural images, and the performance proves that the presented method in this dissertation is effective and feasible.
Keywords/Search Tags:text detection, multi-script identification, text arrangement, PCA, Euclidean distance, KL distance
PDF Full Text Request
Related items