Font Size: a A A

Document analysis: Table structure understanding and zone content classification

Posted on:2003-11-25Degree:Ph.DType:Dissertation
University:University of WashingtonCandidate:Wang, YalinFull Text:PDF
GTID:1468390011480506Subject:Engineering
Abstract/Summary:
For the last three decades, the document image analysis researchers have successfully developed many methods for character recognition, page segmentation of text-based documents. Most of these methods were not designed to handle documents containing complex objects, such as tables. We develop a table structure understanding system which can detect and decompose table structures from document images. Our algorithm use a background analysis technique to locate the table candidates and then validate them by using various measurements. An iterative optimization method is used to optimize the context probability. Our algorithm is probability based, where the probabilities are estimated from an extensive training set of various kinds of measurements of distances between the terminal and non-terminal entities with which the algorithm works. The off-line probabilities estimated in the training then drive all decisions in the on-line table structure understanding modules. We propose an experimental protocol that can simulate any given table ground truth with additional controlled variety. We present a table structure understanding performance evaluation protocol. Our algorithm reaches a 97.05% and 97.28% correct detection rates on cell and table levels, respectively.; We propose a new machine learning based approach for genuine table detection from generic web documents. We design a novel web document table ground truthing protocol and use it to build a large table ground truth database. Experiments on this database demonstrate a significant performance improvement over another rule-based system.; Given segmented zone entities and document image, zone content classification determines the zone types. Our zone content classification algorithms are evaluated on the University of Washington English Document Image Database-III. Using 25 features, we reach an accuracy rate of 98.45%.; We present a text word extraction algorithm that takes a set of bounding boxes of glyphs and their associated text lines of a given document and partitions the glyphs into a set of text words. Experiments on the University of Washington English Document Image Database-III show our algorithm is significantly better than the other two competitive algorithms.
Keywords/Search Tags:Document, Table structure understanding, Zone content, Algorithm
Related items