Font Size: a A A

The Research On Extraction Of Mathmatics Formulas On Web

Posted on:2013-05-06Degree:MasterType:Thesis
Country:ChinaCandidate:L W CuiFull Text:PDF
GTID:2248330371987102Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of information technology, particularly the popularity of computer and the development of Internet application, users increasingly rely on the Internet, and more daily activities, work and study are done on the Internet. As indispensable tools of user activity, the application of mathematical formulas on the Web is more and more widely, leading to the building and extension of math-related content on the Web quickly. With mathematical support of Web technology becoming more and more sophisticated and perfect, users publish, acquire and manage mathematical formulas on the Web, which require the support of search engine for mathematical formulas. Mathematics search engine is one of the research topics of the third generation intelligent search engines. Based on mathematical formulas, Crawler is a very important part of mathematics search engine, the quality of which directly impacts on the functionality and performance of the mathematics search engine.This paper focuses on the study of formulas-based crawlers, mainly involving extraction, recognition and system design of mathematical formulas from the Web. At present, the recognition of mathematical formula from the picture has made considerable progress, but it is still difficult for recognition from the Web, and the current identification techniques cannot be applied to the exchange and search of mathematical formulas. In this paper, the recognition of mathematical formulas has made targeted research, focusing on the extraction of Web documents in LaTeX format, XML format, Infix format and Office series formula and so on. Analyzing the characteristic of these formulas, this paper proposes a formula recognition method based on pattern characteristics and heuristic rules. From the above research, based on open source software Nutch system, MathCrawler have favorable system architecture, can acquire documents containing mathematical formulas and extract the mathematical formulas from the documents. Experiments show that the system has a good performance and can be more accurately extracted mathematical formulas.
Keywords/Search Tags:Search Engine, Crawler, Formulas Extraction, Mathematical Formulas, MathML, OpenMath
PDF Full Text Request
Related items