Font Size: a A A

A syntactic approach to document segmentation and labeling

Posted on:1991-11-02Degree:Ph.DType:Dissertation
University:Rensselaer Polytechnic InstituteCandidate:Viswanathan, MaheshFull Text:PDF
GTID:1478390017452496Subject:Electrical engineering
Abstract/Summary:
The spatial structure of a document image is hierarchically identified and its various component blocks are labeled without using optical character recognition. A document image is a bit-map produced by raster-digitizing (scanning) a printed page from technical journals. It may contain fields of text, formulas, tables and figures. Publication-specific knowledge is used in the segmentation and labeling of these blocks. This knowledge is coded in the form of block-grammars that describe the spatial relationships between various entity classes or blocks. The two-dimensional data in each block is converted to one-dimensional strings, so that string grammars may be applied. Segmenting and labeling these blocks can be envisaged as parsing these strings. Different grammars applied to each segmentable document block until the desired sub-blocks are extracted. Starting with the horizontal direction for the whole document page, each level in the hierarchy is processed in a direction orthogonal to the previous level. The grammars are implemented using lexical and syntax analysis tools such as Lex and Yacc. These block-grammars can be meticulously coded by hand or generated from a parameter table.;Erroneous segmentations and labelings may be corrected, at any particular level and at higher levels, by using multiple grammars and backtracking. Finally, the maximum possible area of the document is labeled and presented. The method is applied to printed pages digitized at 300 dpi. Experimental results are shown for a training set of 41 pages and a test set of 24 pages from the IBM Journal of Research and Development and the IEEE Transaction on Pattern Analysis and Machine Intelligence.
Keywords/Search Tags:Document, Blocks
Related items