Font Size: a A A

Feature-based Document Image Retrieval

Posted on:2010-12-22Degree:MasterType:Thesis
Country:ChinaCandidate:T ZhangFull Text:PDF
GTID:2208360275963025Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
As an important part of image retrieval, Document image retrieval has wide applications on digital library, AO (Automatic Office); etc. Document image retrieval aims at finding out a sequence of document images with high similarity according to the input image or feature. Ordinary document image retrieval algorithms can be classed into two classes, character content (OCR) based method and image level feature based method. Document matching technology is the radical base in image level feature based method. It accomplishes the procedure of finding the best matched image from image database. Definition and extraction of features comprise the foundation of document matching .A match must at least correspond to a feature.Based on the analysis of the existing feature-based document image retrieval methods'drawbacks and virtues, a novel retrieval method is proposed .It falls in with the basic procedures of feature-based document image retrieval methods and takes advantage of sub-block method in content-based image retrieval methods. Firstly, preprocessing of the document image is done. It includes noise-removing and skew detecting .Using filter template the noise can be removed efficiently, after filtering, the SIFT feature is extracted. As the first step of skew detecting, the binarization utilizes the local and global statistical data. Then is the extraction of borderlines of the document and skew detection based on the borderlines. With the introduction of LMS algorithm, the detecting of skew becomes more intelligent. Meanwhile, the borderlines extracted could be used as features in the retrieval .After the preprocessing, localization of effective area of the image is performed, and then, the length and width of effective area and density feature are defined and extracted .The next step is to segment the effective area into text areas and non-text areas. The segmentation method is based on ISI learning algorithm, which is implemented by learning the segmentation templates .Based on the text area, local features such as the distances between components (the lengths of gaps), heights of components and widths of components are extracted. And global features such as the number of connected components, the number of cavities, the average height of components, the average width of components, the average distance between components and paragraph feature are extracted. As for the non-text areas, the key block feature is extracted. The SIFT feature is invariant to scalability, shift and distortion. This renders it robust to deformation of the document image. Features extracted from text area are low level features, which could characterize the document in deep degree. Density feature as well as key block feature is proved to be efficient in describing the document image. In other words, the features defined and extracted in this paper are efficient. They include not only global features but also local features. And low level features as well as high level ones are all covered. Hence, the combination of these features gives sufficient representation of the image. These features are grouped into 3 feature vectors according to their dimensions and properties. As a high-dimensional index structure appropriate for indexing very high dimensional data, A-tree is used in the paper to organize these features extracted from the database of document images. Each feature vector is used to create an A-tree respectively. By inquiring each A-tree, three sets are obtained, and the candidate set is then created by the union of them. After updating the weight of the candidate set, the final answers are gotten according to the weight.The retrieval method proposed is adaptive to handwritten and printed document images.The experiments are performed on databaseⅠwhich consist of 3900 text dominated images and databaseⅡconsisting of 2124 images mixed with texts, pictures and tables respectively in order to give an experimental verification of the adaptation of these features. And then the retrieval method is experimented on the whole database. A contrasting experiment is performed to give comparison of the proposed method and existing one. The results show that the proposed method has good performance and robustness, and it is a practical method.
Keywords/Search Tags:document image retrieval, SIFT feature, density feature, key block feature, text area feature, non-text area feature, high-dimensional index structure
PDF Full Text Request
Related items