Font Size: a A A

Extraction And Analysis Of Formula And Text In The Document Image With Complex Layout

Posted on:2016-01-30Degree:MasterType:Thesis
Country:ChinaCandidate:J Y HeFull Text:PDF
GTID:2348330488474362Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the advent of the information age, it is more intelligent and efficient to deal with information. The traditional document information is recorded in the paper documents primarily, the OCR technology realizes the digitization of paper documents, and the layout analysis is the premise of OCR implementation. Because of the short research history and the difference to English layout in the character form and layout method, the study of the layout of Chinese layout analysis is of great importance. Chinese layout includes four parts: graphs, tables, formulas and text, of which formulas and text are the main parts. Formulas have similar structure and composition to text, therefore formulas and text extraction have some difficulties. This paper makes a deep research on the complex Chinese document image, and the specific work is as follows:(1) Document image preprocessing. First, remove the boundary noise by projection profile analysis method. Then, get rid of the salt and pepper noise by the connected components labeling combined with the median filter method, and compared with the traditional median filter method, the experimental results prove the validity of this method. Finally, Hough transform is used to realize the tilt correction of document image.(2) Formulas and text preprocessing. When we extract formulas and text in document image, the layout structure(transverse and longitudinal row and column numbers) and content(title, page numbers) will seriously affect the extraction results. In order to solve this problem, this paper first uses the combination of connected domain method and nearest neighbor method to realize the judgment of transverse and longitudinal row, then uses the projection method combined with the morphological algorithm to extract titles and page numbers, finally uses the combined method of the projection method and the domain extraction to implement document column number judgment before formulas and text orientation. The above part is formulas and text preprocessing, which is the guarantee of the realization formulas and text extraction.(3) Formulas orientation in document image. This paper use the projection method, run length smoothing algorithm, connected domain contour extraction algorithm and improved rule definition method to realize the extraction of independent formulas in complex Chinese document layout. Compared with the traditional machine learning method, the method used in this paper complete the extraction of the independent line formulas in the low resolution image, and the accuracy of the method is more than 80%.(4) Document image text line extraction and mergence.This article uses the connected domain analysis and combined with the method of the run-length smoothing algorithm, to implement the document image text line extraction, and the accuracy of the method is more than 81%.Then combines Sobel operator with morphological algorithm to realize the merger of a line of text.
Keywords/Search Tags:layout analysis, formula extraction, text extraction, contour extraction, mathematical morphology
PDF Full Text Request
Related items