A multiscale domain-independent algorithm for document image segmentation

Posted on:2004-06-24

Degree:M.Sc

Type:Thesis

University:Queen's University (Canada)

Candidate:Chen, Sean Jy-Shyang

Full Text:PDF

GTID:2468390011460650

Subject:Computer Science

Abstract/Summary:

Document Image Segmentation is a crucial step in the conversion process for paper document images into electronic documents. Entities in a document image, such as text blocks, tables and figures need to be separated before further document analysis and recognition can occur. Many Document Segmentation algorithms are designed exclusively for a few specific document types, utilizing highly-specialized document models.; This thesis presents a domain independent segmenter which does not assume specific document layout models in its segmentation. The segmenter utilizes a minimal amount of image domain knowledge. Segmentation of graphic and text entities is based purely on their geometric attributes and tonal values. Entities from the document images are extracted as non-overlapping sub-images by the segmenter.; The segmenter is a general-purpose tool, which can be used for segmentation tasks when domain specific models would be inappropriate, for example, in the purposes of image retrieval. The output of the segmenter can also be used to identify the domain of a document. Subsequently an algorithm specific for that domain may be applied to the image to produce a refined segmentation. The segmenter can also act as a pre-segmenter to separate out document entities so that they can be resegmented by domain specific segmenters. Due to the general nature of the segmenter, it can also be used for segmenting natural images. Results of segmentation are shown on a diverse set of test images.

Keywords/Search Tags:

Segmentation, Image, Document, Domain, Entities

Related items

1	Digital Watermarking Algorithm Based On The Image Feature Of Document Image
2	Research On The Automatic Acquisition Of Domain Entities
3	Image Segmentation In The Document Image Processing Applications
4	Research On Listed Company Announcement Document-level Event Extraction
5	Research And Application Of Document Image Paragraph Segmentation
6	Research On Extraction Of Web Data Entities Based On Domain Features
7	Research On Document Image Watermarking Based On Print-Scan Invariants And Double Domain
8	A tale of two paradigms: Disambiguating extracted entities with applications to a digital library and the Web
9	The Study On Subpixel Document Segmentation
10	The Research And Application Of Image Segmentation On Document Image Watermarking