A syntactic approach to document segmentation and labeling

Posted on:1991-11-02

Degree:Ph.D

Type:Dissertation

University:Rensselaer Polytechnic Institute

Candidate:Viswanathan, Mahesh

Full Text:PDF

GTID:1478390017452496

Subject:Electrical engineering

Abstract/Summary:

The spatial structure of a document image is hierarchically identified and its various component blocks are labeled without using optical character recognition. A document image is a bit-map produced by raster-digitizing (scanning) a printed page from technical journals. It may contain fields of text, formulas, tables and figures. Publication-specific knowledge is used in the segmentation and labeling of these blocks. This knowledge is coded in the form of block-grammars that describe the spatial relationships between various entity classes or blocks. The two-dimensional data in each block is converted to one-dimensional strings, so that string grammars may be applied. Segmenting and labeling these blocks can be envisaged as parsing these strings. Different grammars applied to each segmentable document block until the desired sub-blocks are extracted. Starting with the horizontal direction for the whole document page, each level in the hierarchy is processed in a direction orthogonal to the previous level. The grammars are implemented using lexical and syntax analysis tools such as Lex and Yacc. These block-grammars can be meticulously coded by hand or generated from a parameter table.;Erroneous segmentations and labelings may be corrected, at any particular level and at higher levels, by using multiple grammars and backtracking. Finally, the maximum possible area of the document is labeled and presented. The method is applied to printed pages digitized at 300 dpi. Experimental results are shown for a training set of 41 pages and a test set of 24 pages from the IBM Journal of Research and Development and the IEEE Transaction on Pattern Analysis and Machine Intelligence.

Keywords/Search Tags:

Document, Blocks

Related items

1	A system for intelligent document image analysis, recognition and compression
2	Research And Implement Of The Computer-Aided Copy Detection System For Document
3	Research On Secure Transportation Of Data Blocks Of IP Protocol
4	Research And Implementation Of Network Multimedia Classroom Pure Software Model
5	Research On Theory And Development Technology Of CAD/CAM Integrated System For Hydraulic Manifold Blocks
6	Scientific Research Document Retrieval And Recommendation System Based On Doc2Vec
7	A Study Of Document Composite And Document Security For Ubiquitous Computing Mode
8	The Design And Implementation Of Document Flow System Based On J2EE And Workflow
9	All-Zero Blocks Detection And Its Applications In H.264/AVC Video Encoding
10	The Timing Of Document Flow Management System Analysis And Design