Font Size: a A A

Research Of Character Segmentation Technology In Web Images

Posted on:2009-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:X PengFull Text:PDF
GTID:2178360278464143Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
More and more images are added to pure textual web pages in the Internet, theses images contain plenty of character information that can not only be used by traditional text-based search engine to index and search web pages but also can help multimedia search engine to search images. To make web pages more attractive, characters in web pages may have more affluent color, language type, text style, and flexible text layout; their size may also be quite small. So it is necessary to research character segmentation for web images according to the above features basing on the existing character segmentation technique.Character segmentation is usually divided into two steps: character detection and character extraction. Character detection algorithm is used to detect text regions in images. To solve this problem, an edge feature based detection method is designed and implemented in this paper. This kind of method is efficient and robust to variation of character size, color, language type.Binarization technique is usually employed by existing character extraction algorithms, a more reasonable method is to category the text region into different components according to their color (gray scale) feature when there are many components of different colors (gray scales). Histogram segmentation can be used to divide the gray scale space of images. So a character extraction method based on histogram segmentation is offered. This algorithm makes use of the distribution variation of difference histogram and can find segmentation points exactly. With the priori knowledge, characters can be effectively extracted.When processing non-text component which has similar color (gray scale) feature as character in the detected text regions, histogram segmentation based method can not obtain good result. This problem can be effectively solved when the location information is considered. So, a DBSCAN(Density-Based Spatial Clustering of Applications with Noise)based character extraction method is offered. This method treats the character extraction process in images as clustering those pixels which have similar color (gray scale) and are in one density region. All pixels in one class form a component of images. After using some determinant regulars, characters can be acquired.Compared to DBSCAN based character extraction method, histogram segmentation based method is more efficient. To improve the efficiency of the whole character segmentation process, there is a need to use some simple rules to judge the detection result. Large text regions are fed to BSCAN based method because it is more probable that these regions contain color (gray scale)-similar non character components. While small text regions are fed to the histogram segmentation based algorithm.The performance of character detection algorithm, histogram segmentation based extraction algorithm, DBSCAN based extraction algorithm and the combination extraction algorithm is analyzed in the experiment part.
Keywords/Search Tags:Character Segmentation, Character Detection, Character Extraction, Web Images, Histogram
PDF Full Text Request
Related items