Extraction, Recognition And Reconstruction Of Mathematics Formulas In English Scientific Document

Posted on:2008-01-19

Degree:Doctor

Type:Dissertation

Country:China

Candidate:F Li

Full Text:PDF

GTID:1118360218453555

Subject:Computational Mathematics

Abstract/Summary:

With the improvement of the storage capability of computer, more and more documents are scanned into computer and saved in bitmap form. There has been growing interest to convert these document images into a retrievable and editable form. Document image analysis (DIA) comes into being to do this job. Optical character recognition (OCR) is the core of DIA, dealing with either printed or handwritten document. At present, the recognition ratio of printed document has achieved a high standard, and OCR of printed document has been extensively applied in fields like office automation and digital library. However, there are great amounts of mathematical formulas in the scientific documents, these formulas usually have Greek characters and other special symbols, and there often exist two-dimensional position relationships among the symbols of formula. The traditional OCR product can not handle the formula images with two-dimensional structure. Therefore, currently, the only method for the reusing of formulas in the scientific documents is re-input them by hand. To this end, this thesis presents an OCR system to understand formulas in binary printed document images. The contents of the thesis is as follows.Chapter 1 reviews the history and related technology of DIA and mathematic formula extraction, and discusses the advantages and disadvantages of existing algorithms. The structure of our novel system is also outlined in Chapter 1. The novel system can automatically extracts the mathematical formulas in document images and recognizes the characters in the formula. After analysis of the structure of formula with LL(1) grammar, the novel system eventually outputs the L~AT_EX format for the formulas in the original document image.Chapter 2 presents a definition of the local maximum component (the component for short) for DIA and describes the algorithm for the component labeling. The novel algorithm uses a contour tracing technique to detect and label the external contour of each component and removes the interior area of each component form the copy of the source image. The labeling and the removing are completed in a single pass over the source image. Besides, Chapter 2 includes the experiments for the comparison of our novel algorithm and the traditional ones.Utilizing the novel component labeling algorithm, a novel method for mathematical formula extraction from English scientific document images is presented in Chapter 3. Firstly, a benchmark parameter is calculated using the statistic data of the whole document image. Secondly, the document image is divided into lines with horizontal project data of components in the image and each line is divided into some sub-regions in terms of the vertical projection data. These sub-regions are classified in terms of the benchmark parameter. Finally, the locations of the formulas in the document image are obtained by suitably merging certain specific regions. The novel method can be used for picture-text mixed documents and can reduce the effect of the pictures and forms in the document image on mathematic expression localization.Chapter 4 introduces the mathematical formula reorganization and reconstruction algorithms which are adopted in the system. The features of characters are extracted by the Zernike moments, and a multi-classifier composed of SOFM and BP neural networks is then adopted for the symbols recognition. In order to segment merged characters in the image, a segmentation algorithm based on a modified SOFM neural network was introduced into the system. A formula structure analysis method based on LL(1) grammar is introduced in Chapter 4, too. With the employment of LL(1) grammar, this system can eventually convert the recognition results into a IATEX format string.At the end of this thesis, the remaining problems in the system are analyzed. A discussion on the expansibility of the novel system is also included.

Keywords/Search Tags:

Optical character recognition, local maximum component labeling, formula extraction, merged characters segmentation, formula reconstruction

Related items

1	Research On Technology Of Optical Formula Recognition
2	Research On Key Issues Of Printed Mathematical Formula Recognition
3	A Method For Detecting Merged Subscripts In English Scientific Document
4	Basic Optical Formula Recognition Technology Research
5	Research And Implementation On Printed Mathematical Formula Recognition
6	Extraction Of Mathematics Formulas In Chinese Scientific Document
7	Research And Implementation On Detection And Recognition Algorithm For Mathematics Formulas In Documents
8	The Study On The Method Of Matrix Structural Analysis In Printed Mathematical Formula
9	The Study Of Mathematical Formula Extraction With The Script Identification
10	System Of Mathematical Formula Recognition In Printed Chinese Documents