Extraction Of Mathematics Formulas In Chinese Scientific Document

Posted on:2008-07-24

Degree:Master

Type:Thesis

Country:China

Candidate:B N Sang

Full Text:PDF

GTID:2178360218455285

Subject:Computational Mathematics

Abstract/Summary:

With the development of computer and internet, more and more data have been storedinto computer in the form of document images. And internet has become the major media forstorage, searching and spreading of information. How to converse those document images intoeditable form in a fast and efficient way is a problem to be resolved urgently, in which DocumentImage Analysis (DIA) arises as a new academic research field.Optical character recognition (OCR) is the core technique of DIA. The existed OCR sys-tems have a very high recognition ratio. But because of the planar structure of mathematicsformulas, recognizing by merely enlarging the symbol library will not record the image infor-mation completely.How to abstract, recognize and reconstruct mathematics formulas in printed scientific doc-uments is a question in discussion. While many algorithms have been developed, most of themcan only be used in an English context. Due to the different components structure of Englishletters and Chinese characters, directly transferring the algorithms into Chinese context willgenerate numerous errors, and also is a waste of traits Chinese documents.Chapter 1 reviews the history and related technology in document image analysis, Patternrecognition, and Artificial neural networks.Since the recognition of mathematics symbols is necessary for our formula extraction al-gorithm, in Chapter 2 the features of characters are extracted by the Zernike moments, and amulti-classifier composed of SOFM and BP neural networks is then adopted for the symbolsrecognition.Chapter 3 presents the problems occurred when one tries to extract formulas from Chinesescientific documents by reviewing some existed formulas extraction algorithms oriented in En-glish context. Discussed are components labeling, traits of Chinese document typeset, humanreading habit, and the locality of the distribution of mathematics formulas.Through the above discussions, a new algorithm is proposed, in which a group of standardinput panes are used to judge if the input image is a Chinese character. Therefore componentslabeling is skipped. Utilizing the locality of the distribution of mathematics formulas, different methods are adopted according to different distribution intensity in order to generally acceleratethe speed of the algorithm. Many specific problems such as defining the standard input pane,confirming the input image as a Chinese character, and miner difference caused by Chinesetypeset are considered in this chapter.At the end of this thesis, the remaining problems in the system are analyzed. A discussionon the expansibility of the new algorithm is also included.

Keywords/Search Tags:

Document image analysis, formula extraction, Chinese document context, component labeling, formula distribution locality

Related items

1	Extraction, Recognition And Reconstruction Of Mathematics Formulas In English Scientific Document
2	Mathematical Formula Extraction In Printed-Chinese Documents Based On EEN Feature Function
3	Mathematical Formula Feature Extraction And Locating In Chinese Scanned Printed Document
4	Mathematical Formula Locating In Chinese Image Document
5	Research On The Mathematical Formula Recognition Technology For Printed Document
6	Extraction And Analysis Of Formula And Text In The Document Image With Complex Layout
7	The Study Of Mathematical Formula Extraction With The Script Identification
8	An English Scientific Document Retrieval Method Based On Formula Description Structure And Word Embedding
9	System Of Mathematical Formula Recognition In Printed Chinese Documents
10	Similarity Computing Of Scientific And Technical Documents Based On Texts And Formulas