Font Size: a A A

A Study Based On Layout Analysis Of Document Image Retrieval Algorithm

Posted on:2012-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhaoFull Text:PDF
GTID:2178330332990408Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The information technology revolution which the birth of the computers and the Internet leads to, makes a great deal of electron documents flow out to appear, and carries out high-speed information exchange, mass storage of information, and information retrieval, furthermore, hold time of the information becomes longer and longer. Compared with traditional documents on paper, electronic documents have a number of advantages and performance, for example, small storage space, retrieving easily, transmitting fast, updating easily and so on. Furthermore, electronic documents can be coded securely to improve its reliability. Retrieving images which the desired information of users has become the focus issue.Document images which mainly contain text and also contain images, as well as tables, are different from the typical natural images. Generally, the working papers are the existing form of the document images, and have been widely used. Therefore, the document image retrieval has been widely concerned. The document images generate by scanning paper documents, and then using character recognition tools identifies useful information in the images. As the appearance and maturity of the OCR (Optical Character Recognition) technologies, the usage and management of digital documents will be widely applicable and improve the document processing efficiency. Document images analysis, a very important part of the System of Printed Chinese Character Recognition, which is considered equal importance as the Character Recognition. OCR is a studying filed which developed rather forward in the field of pattern recognition. After several decades of shape, the layout analysis has been more mature. Layout analysis is the first step OCR system into the automation stage, and the efficient of its results affects directly the operation of character recognition module, and thereby affects the efficiency of the whole system. Therefore high-efficient layout analysis systems play a very important role on improving the quality of the OCR system. In detail, the layout analysis refers to the process of automated analysis, recognition and understanding, which is based on the graphic, images information and structure relationship in the images.The process of image retrieval includes extraction of image feature and feature matching, then, take advantage of the distance measure method to compare the similarity of image, descend order the retrieved results according to the similarity, and then output the result to the user. Feature extraction and matching is the key of the image retrieval. Layout feature of the document images includes head, paragraphs, lines, and so on. In the circumstances that do not make use of the costly technique OCR to the character identifies and the direct action on images data, we draw lessons from the analytical method of the layout feature of images, used this method for the traditional content-based image retrieval, and then proposed the point of view, namely, extract the line feature from the layout in the text area of the document image which is used to build index, then carry on the image matching and similarity measure, then form a new search algorithm with new retrieve features. The algorithm is actually used in image matching and recognition, and the recognition effect is fine comparatively.The operation target is text area, so before feature extraction, the use of document layout analysis method judges if the images contain images, tables and other non-text area. The non-text area filter is used to filter these areas and retain the text area. For this reason, it limits the application scope of this method. But this method do not address complex document image layout, such as containing horizontal, vertical, and horizontal and vertical mixing of the layout. The application scope is relatively narrow.Matching is the critical technology in the document image retrieval, and it is mainly given best match of the input image from the document database. Matching techniques bases on feature definition and extraction. There are plenty of distance measure methods that are used to judge the similarity among different layout. In this paper, we use point pattern matching, which is based on the line features. The line whose length is defined as gray value of the point, is abstracted as a point. The center point weighted average method is used to find out the center of the image, and then calculating the relative coordinates. Differences energy is used to match the similarity of image. However, point pattern matching has relatively high time complexity, and then, it requires further improvement.
Keywords/Search Tags:Layout feature, document image retrieval, feature extraction
PDF Full Text Request
Related items