Font Size: a A A

Extraction Of Mathematics Formulas In Chinese Scientific Document

Posted on:2008-07-24Degree:MasterType:Thesis
Country:ChinaCandidate:B N SangFull Text:PDF
GTID:2178360218455285Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
With the development of computer and internet, more and more data have been storedinto computer in the form of document images. And internet has become the major media forstorage, searching and spreading of information. How to converse those document images intoeditable form in a fast and efficient way is a problem to be resolved urgently, in which DocumentImage Analysis (DIA) arises as a new academic research field.Optical character recognition (OCR) is the core technique of DIA. The existed OCR sys-tems have a very high recognition ratio. But because of the planar structure of mathematicsformulas, recognizing by merely enlarging the symbol library will not record the image infor-mation completely.How to abstract, recognize and reconstruct mathematics formulas in printed scientific doc-uments is a question in discussion. While many algorithms have been developed, most of themcan only be used in an English context. Due to the different components structure of Englishletters and Chinese characters, directly transferring the algorithms into Chinese context willgenerate numerous errors, and also is a waste of traits Chinese documents.Chapter 1 reviews the history and related technology in document image analysis, Patternrecognition, and Artificial neural networks.Since the recognition of mathematics symbols is necessary for our formula extraction al-gorithm, in Chapter 2 the features of characters are extracted by the Zernike moments, and amulti-classifier composed of SOFM and BP neural networks is then adopted for the symbolsrecognition.Chapter 3 presents the problems occurred when one tries to extract formulas from Chinesescientific documents by reviewing some existed formulas extraction algorithms oriented in En-glish context. Discussed are components labeling, traits of Chinese document typeset, humanreading habit, and the locality of the distribution of mathematics formulas.Through the above discussions, a new algorithm is proposed, in which a group of standardinput panes are used to judge if the input image is a Chinese character. Therefore componentslabeling is skipped. Utilizing the locality of the distribution of mathematics formulas, different methods are adopted according to different distribution intensity in order to generally acceleratethe speed of the algorithm. Many specific problems such as defining the standard input pane,confirming the input image as a Chinese character, and miner difference caused by Chinesetypeset are considered in this chapter.At the end of this thesis, the remaining problems in the system are analyzed. A discussionon the expansibility of the new algorithm is also included.
Keywords/Search Tags:Document image analysis, formula extraction, Chinese document context, component labeling, formula distribution locality
PDF Full Text Request
Related items