Font Size: a A A

Document structure analysis and performance evaluation

Posted on:2000-09-15Degree:Ph.DType:Dissertation
University:University of WashingtonCandidate:Liang, JishengFull Text:PDF
GTID:1468390014960787Subject:Engineering
Abstract/Summary:
The goal of document image structure analysis is to find an optimal solution partitioning the set of glyphs on a given document image into a hierarchical tree structure where entities within the hierarchy are associated with their physical properties and semantic labels. In this dissertation, we present a unified document image structure extraction algorithm that is probability based, where the probabilities are estimated from an extensive training set of various kinds of measurements of distances between the terminal and non-terminal entities with which the algorithm works. The off-line probabilities estimated in the training then drive all decisions in the on-line segmentation module. An iterative, relaxation-like method is used to find the partitioning solution that maximizes the joint probability. This approach can be uniformly apply to the construction of the document hierarchy at any level. We have implemented a text line segmentation algorithm and a text block extraction algorithm using this framework. Another example is the development of a system that detects and recognizes special symbols (Greek letters, mathematical symbols, etc.) on technical document pages, that are not handled by the current Optical Character Recognition (OCR) systems.; A large quantity of ground-truth data, varying in quality, is required in order to give an accurate measurement of the performance of an algorithm under different conditions.; We have constructed the University of Washington English Document Image Database-III, which contains 1600 scanned scientific/technical document image pages that come with manually edited ground-truth of entity bounding boxes and properties. Based on the ground-truth data, we can evaluate the performance of document analysis algorithms and build statistical models to characterize various types of document image structures. In this dissertation, we present a set of quantitative performance metrics for each kind of information a document image analysis technique infers. The text line and text block extraction algorithms were trained and evaluated on the UW-III database using a cross-validation method. The text line extraction algorithm identifies and segments 99.76% of text lines correctly, while the preliminary result of the text block extraction shows 91% accuracy.
Keywords/Search Tags:Document, Structure, Text block extraction, Text line, Performance
Related items