The Research And Application Of Segmentation Method Between Image And Text In Layout Analysis

Posted on:2014-02-23

Degree:Master

Type:Thesis

Country:China

Candidate:Y Y Liu

Full Text:PDF

GTID:2248330395498636

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

OCR technology refers to converting the printed text into optical image, thus converting it into editable digital information. For page mixed with image and text, layout analysis is needed before the OCR process. The task of layout analysis is to segment the document image into different areas according to the area type, then label the attributes of area (type include text, table and image, and location etc), and to determine the order of text areas.Layout analysis is the precondition of OCR and document refactor. The results of layout analysis directly affect the effect of document recognition and refactor. Current layout analysis needs further improvement in accuracy, especially in mixed Chinese complex page layout that inaccurate segmentation problems occur.Layout analysis of page mixed with image and text has problems as mainly as follows:(I) The irregular images embedded in the page text, some words may be divided into the image area, leading to text informationâ€™s loss after recognition.(II) Page header and footer interfere with the logic and integrity of body content after recognition.This thesis does research and put forward solutions for the above questions. In this paper, the existing analysis method of the layout, especially segmentation method between image and text, page header and footer identification method were studied. The research results are as follows:(I) The segment method between image and text based on neighborhood analysisIn response to some words are divided into the rectangle of image. segment method between image and text based on neighborhood analysis is adopted which makes full use of the neighborhood text information to detect the probable text in the rectangle of image, then segment text according to the row (column), realize the segmentation between graph and text, and then adjust the text field.(II) A page header and footer identification method based on division line and region characteristicsIn response to the need of distinguishing page header and footer from body, an identification method combining division line and region characteristics is adopted. Experimental results proved the algorithmâ€™s effectiveness and generality.This thesis derives from the National Key Technology R&D Program for the11th five-year plan--"Development of Reading Auxiliary Appliance for the Visually Impaired"(2009BAI71B02), whose goal is to develop a sound electronic reading product based on character recognition. The portable device obtains printed information through taking pictures, and then converts pictures information to speech users after image processing, OCR and speech synthesis. With the aid of the product, the blind can read books and magazines, etc. Layout analysis as an important part of the function of character, improving the handling ability of layout analysis, including distinguishing header, footer from body, separating text from image properly to provide full text information, is of great significance.

Keywords/Search Tags:

optical character recognition (OCR), layout analysis, segmentationbetween image and text, page header, page footer

PDF Full Text Request

Related items

1	Research Of Layout Analysis On Complex Chinese Document Images
2	Study On Web Page Ratinality Of Universities’ Websites In Hebei Province
3	Research And Implementation On Key Technology Of Web Text Collection And Analysis
4	Research On Layout Analysis And Text Line Extraction Of Document Image
5	Stored In Corporate Competitive Intelligence, Intelligence Collecting Platform Based On Web-page Analysis
6	Research On Webpage Recognition Technology Based On Vision And Semantics
7	Research On Web Page Classification And Information Collection
8	Study On The Tag-based Analysis Technique Of Extracting The Body Of The Page
9	Study On Web Page Watermarking
10	The Research And Implementation Of One Kind Of Web Page Filtering Method Based On Real-Time Network Traffic Data