Text Information Extraction in image has been a research hotspot for a long time. It's a significant procedure for image classification and retrieval, book management, certificate processing and so on. It includes image filtering, image geometry adjustment, text location, color clustering, image segmentation, skew adjustment, character cutting, binarization and OCR, etc. Text location and OCR are of vital importance. As OCR techniques are very mature, text location techniques are the topic of this essay. We analyze the characteristic of text area in print media and propose two novel text location methods.First, we research into texture-based and area-based traditional text location methods. And propose a connected-component based method. Firstly, adjust input image according to skew estimation. Secondly, cluster the image on its colors. Thirdly, extract connected components from every binary image with corresponding color. Fourthly, by means of multiple features, classify extracted connected components into text or non-text areas. Finally, combine the binary images together, which generate text areas of input image. The proposed method is able to locate obvious text areas in images accurately. But for drawbacks of clustering and immature features, it can't locate background-like and skewed text. In addition, there is a common disadvantage of traditional text location method, which is too many thresholds are defined manually, bringing danger to system reliability.Aiming at shortcomings of tradition text location methods, we propose a machine learning based method. First, collect large quantities of text images, including book covers, CD covers and movie posters shot with cameras. We manually label and extract text regions in them. Second, based on statistical analysis of the difference between text and non-text samples, we get three sets of features which are used to produce weak classifiers. Third, AdaBoost is utilized to select and combine these weak classifiers into three-stage attentional cascade. At last, this three-stage cascade can locate text area in images by classifying sub-regions of images as text and non-text.Experiment results show that, compared with proposed connected component based text location method, proposed machine learning method has the following advantages: needless of pretreatment (skew estimation and adjustment, color clustering), auto-generated threshold, strong expansibility. Compared with other methods nowadays, it's robust in locating single characters, skewed and even vertical text lines. |