Document structure analysis and performance evaluation

Posted on:2000-09-15

Degree:Ph.D

Type:Dissertation

University:University of Washington

Candidate:Liang, Jisheng

Full Text:PDF

GTID:1468390014960787

Subject:Engineering

Abstract/Summary:

The goal of document image structure analysis is to find an optimal solution partitioning the set of glyphs on a given document image into a hierarchical tree structure where entities within the hierarchy are associated with their physical properties and semantic labels. In this dissertation, we present a unified document image structure extraction algorithm that is probability based, where the probabilities are estimated from an extensive training set of various kinds of measurements of distances between the terminal and non-terminal entities with which the algorithm works. The off-line probabilities estimated in the training then drive all decisions in the on-line segmentation module. An iterative, relaxation-like method is used to find the partitioning solution that maximizes the joint probability. This approach can be uniformly apply to the construction of the document hierarchy at any level. We have implemented a text line segmentation algorithm and a text block extraction algorithm using this framework. Another example is the development of a system that detects and recognizes special symbols (Greek letters, mathematical symbols, etc.) on technical document pages, that are not handled by the current Optical Character Recognition (OCR) systems.; A large quantity of ground-truth data, varying in quality, is required in order to give an accurate measurement of the performance of an algorithm under different conditions.; We have constructed the University of Washington English Document Image Database-III, which contains 1600 scanned scientific/technical document image pages that come with manually edited ground-truth of entity bounding boxes and properties. Based on the ground-truth data, we can evaluate the performance of document analysis algorithms and build statistical models to characterize various types of document image structures. In this dissertation, we present a set of quantitative performance metrics for each kind of information a document image analysis technique infers. The text line and text block extraction algorithms were trained and evaluated on the UW-III database using a cross-validation method. The text line extraction algorithm identifies and segments 99.76% of text lines correctly, while the preliminary result of the text block extraction shows 91% accuracy.

Keywords/Search Tags:

Document, Structure, Text block extraction, Text line, Performance

Related items

1	Research On Document Image Layout Analysis And Text Extraction
2	Text association mining with cross-sentence inference, structure-based document model and multi-relational text mining
3	Design And Implementation Of Web Document Extraction And Offline Collection System
4	Study On Method To Automatically Analyze The Text Structure Based On The Relevancy Computing Of Text Content
5	Learning-Based Text Extraction In Natural Background
6	Research On Document Retrieval Based On Index Optimization And Text Snippet Mechanism
7	Design And Implementation Of Enterprise Knowledge Document Retrieval Management System
8	Text understanding via semantic structure analysis
9	Design And Implementation Of Text Information Extraction On Smart Phone
10	Research Of Text Extraction Algorithm Based On Visual Semantic Block