Document analysis: Table structure understanding and zone content classification

Posted on:2003-11-25

Degree:Ph.D

Type:Dissertation

University:University of Washington

Candidate:Wang, Yalin

Full Text:PDF

GTID:1468390011480506

Subject:Engineering

Abstract/Summary:

For the last three decades, the document image analysis researchers have successfully developed many methods for character recognition, page segmentation of text-based documents. Most of these methods were not designed to handle documents containing complex objects, such as tables. We develop a table structure understanding system which can detect and decompose table structures from document images. Our algorithm use a background analysis technique to locate the table candidates and then validate them by using various measurements. An iterative optimization method is used to optimize the context probability. Our algorithm is probability based, where the probabilities are estimated from an extensive training set of various kinds of measurements of distances between the terminal and non-terminal entities with which the algorithm works. The off-line probabilities estimated in the training then drive all decisions in the on-line table structure understanding modules. We propose an experimental protocol that can simulate any given table ground truth with additional controlled variety. We present a table structure understanding performance evaluation protocol. Our algorithm reaches a 97.05% and 97.28% correct detection rates on cell and table levels, respectively.; We propose a new machine learning based approach for genuine table detection from generic web documents. We design a novel web document table ground truthing protocol and use it to build a large table ground truth database. Experiments on this database demonstrate a significant performance improvement over another rule-based system.; Given segmented zone entities and document image, zone content classification determines the zone types. Our zone content classification algorithms are evaluated on the University of Washington English Document Image Database-III. Using 25 features, we reach an accuracy rate of 98.45%.; We present a text word extraction algorithm that takes a set of bounding boxes of glyphs and their associated text lines of a given document and partitions the glyphs into a set of text words. Experiments on the University of Washington English Document Image Database-III show our algorithm is significantly better than the other two competitive algorithms.

Keywords/Search Tags:

Document, Table structure understanding, Zone content, Algorithm

Related items

1	Understanding the Logical and Semantic Structure of Large Document
2	Research And Implementation Of Micro-blog Sorting Algorithm Based On Network Structure And Content Understanding
3	Research On The Construction Method Of Streaming Document Corpus Oriented To Structure Understanding
4	Word Document Parsing And Content Desensitization Techniques
5	The Research And Implementation Of An Ontology-based Operation Document Understanding System
6	Research On Table Lookup In Content Centric Networking
7	Study On Zone Routing Based On DSR Protocol
8	Understanding the process of multi-document summarization: Content selection, rewriting and evaluation
9	Form File Identification And Understanding
10	Research And Implementation Of The Web Page Table Structure Recognition