Research And Realization On Document Image Retrieval Of Non-plain Text Oriented

Posted on:2015-03-03

Degree:Master

Type:Thesis

Country:China

Candidate:J X Guo

Full Text:PDF

GTID:2268330428480408

Subject:Computer application technology

Abstract/Summary:

With the rapid development of the electronic information technology and the acceleration of the Internet, image resources are of massive growth. More and more documents are stored in the form of images. In addition to the plain text document images and pure form document images, there are also many document images contain tables or images. How to retrieve these non-plain text document images is worth of in-depth study.The retrieval technology mainly extracts features about texts or characters of the plain text document images, these features are not applicable to the document images which contain table or image. And the features of pure form document images are also not applicable to texts of documents. For the document images which contain both dominant text and table or image, only making full use of text and non-text features and combining these features well can express a document image correctly.This paper presents a method of synthesizing document layout analysis, global features and local features to extract features and retrieve document images. The document images should be preprocessed before the features extraction. For a variety of reasons, the documents may contain noise or tilt when stored as images. And these interferences will affect the features extraction of the document images. So it is necessary to preprocess the document images before features extraction. By investigating the document image preprocessing, this paper presents a method of binarization, denoising and tilt correction to make document images convenient for features extraction. In the features extraction, the document images are analyzed and divided into plain text documents, documents containing tables and documents containing images. Extract the global paragraph features and the local pixel features counter plain text document and text parts of non-plain text document. Extract the relative space position features and the framework features of table cells courier document containing tables. Extract the relative space position features and the projection histogram features counter document containing images. Combine the extracted features as the comprehensive features of document images, and based on witch to retrieve document images.According to different types of the document images in database, store the features of document images to corresponding feature library. When retrieving, according to the type of document to match with image features in the corresponding features library, and retrieve the most similar image base on the distance between the two features vector.Retrieve three types of document images of plain text documents, documents containing table and documents containing images in experiment. And compare with the retrieve methods of text document and form document. The experimental results show that:the method of dividing the document images into different types by analyzing the document layout, and then extracting the global and local features of the document images for each type, and combining the extracted features as the comprehensive features, has a higher correctly rate in retrieving non-plain text document images.

Keywords/Search Tags:

Document images retrieval, Image preprocess, Layout analysis, Feature Extraction

Related items

1	A Study Based On Layout Analysis Of Document Image Retrieval Algorithm
2	Research On Document Image Retrieval Technology Based On Combined Feature
3	Research Of Layout Structure-based Document Image Retrieval
4	Research On Document Image Layout Analysis And Text Extraction
5	Research On Logo Detection And Recognition In Document Images
6	Content-based Image Retrieval
7	Research On Document Retrieval Based On Image Content
8	Digital Library-Retrieval Of The Document Imaging
9	Document Image Layout Analysis Algorithms Based On Attention Mechanism
10	Visual And Textual Based Document Image Layout Analysis Methods