Extraction And Analysis Of Formula And Text In The Document Image With Complex Layout

Posted on:2016-01-30

Degree:Master

Type:Thesis

Country:China

Candidate:J Y He

Full Text:PDF

GTID:2348330488474362

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

With the advent of the information age, it is more intelligent and efficient to deal with information. The traditional document information is recorded in the paper documents primarily, the OCR technology realizes the digitization of paper documents, and the layout analysis is the premise of OCR implementation. Because of the short research history and the difference to English layout in the character form and layout method, the study of the layout of Chinese layout analysis is of great importance. Chinese layout includes four parts: graphs, tables, formulas and text, of which formulas and text are the main parts. Formulas have similar structure and composition to text, therefore formulas and text extraction have some difficulties. This paper makes a deep research on the complex Chinese document image, and the specific work is as follows:(1) Document image preprocessing. First, remove the boundary noise by projection profile analysis method. Then, get rid of the salt and pepper noise by the connected components labeling combined with the median filter method, and compared with the traditional median filter method, the experimental results prove the validity of this method. Finally, Hough transform is used to realize the tilt correction of document image.(2) Formulas and text preprocessing. When we extract formulas and text in document image, the layout structure(transverse and longitudinal row and column numbers) and content(title, page numbers) will seriously affect the extraction results. In order to solve this problem, this paper first uses the combination of connected domain method and nearest neighbor method to realize the judgment of transverse and longitudinal row, then uses the projection method combined with the morphological algorithm to extract titles and page numbers, finally uses the combined method of the projection method and the domain extraction to implement document column number judgment before formulas and text orientation. The above part is formulas and text preprocessing, which is the guarantee of the realization formulas and text extraction.(3) Formulas orientation in document image. This paper use the projection method, run length smoothing algorithm, connected domain contour extraction algorithm and improved rule definition method to realize the extraction of independent formulas in complex Chinese document layout. Compared with the traditional machine learning method, the method used in this paper complete the extraction of the independent line formulas in the low resolution image, and the accuracy of the method is more than 80%.(4) Document image text line extraction and mergence.This article uses the connected domain analysis and combined with the method of the run-length smoothing algorithm, to implement the document image text line extraction, and the accuracy of the method is more than 81%.Then combines Sobel operator with morphological algorithm to realize the merger of a line of text.

Keywords/Search Tags:

layout analysis, formula extraction, text extraction, contour extraction, mathematical morphology

PDF Full Text Request

Related items

1	The Study Of Mathematical Formula Extraction With The Script Identification
2	Research And Implementation Of Moving Object Contour Extraction Algorithm Based On GPU
3	The Extraction Of Mathematical Formulas In Word Documents For Math Retrieval
4	The Characters-Extraction Under Complex Color Background
5	Mathematical Formula Extraction In Printed-Chinese Documents Based On EEN Feature Function
6	Research On Feature Extraction Of Pipeline Defects Based On Mathematical Morphology
7	Research And Application On The Contour Extraction Of Moving Object In Video Surveillance
8	High-resolution Remote Sensing Images Of Urban Road Extraction Based On Mathematical Morphology
9	The Research Of Text-Region Extraction Based On Color Image
10	The Research On Formula Extraction In Digital Image