Font Size: a A A

Research On The HTML And PDF Informaiton Extraction Technology Based XML

Posted on:2007-12-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y J SongFull Text:PDF
GTID:2178360182473261Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As societies move into the information age, users are confronted with considerable quantities of data, most of which are stored in the format of HTML, PDF and others.PDF and HTML are good at describing the format of data display, while bad at the content revelation. As a result, it has become a vital obstacle for people to retrieve information. XML is the data-exchange standard proposed by W3C.It is content-oriented, which just overcomes the disadvantage of HTML and PDF. So, it is urgent to extract information from HTML and PDF documents and transfer them into XML documents. And just this forms my research background on this topic. Rule-based approach is the most popular method for IE(Information Extraction). Firstly, this paper puts forward an approach to the problem of IE. The approach extracts information using the standard XSLT and XPath technologies, which has advantages on data location and document conversion. Secondly, it summarizes the basic theory and standard relating to IE based on XML. In order to generate simple, robust and general extraction rules, we also study the optimization of XSLT extraction rules. Based on these work above, this paper focuses on the realization of extracting information from HTML and PDF documents. The core work of this essay is to develop a system of PDF Information Extraction based on XML. The distinguishing feature of it lies in choosing XML as information display model, and XSLT as information extraction rule. The fundamental thought can be expressed as follows: converting the PDF-formatted document to a XML-formatted middle document first, then applying XSLT rules to the middle document according to its description on text, location and display. The system contains three models. Middle-document Producing Model is the first, the result of which is an XML–formatted middle document describing the display-style and the layout structure features of PDF documents. The second is Rules Producing Model. This model adopts semi-automatic manner to generate XSLT rules. The last is Automatic Extraction Model, which transfers the middle document to a self-described and semi-structured XML document by using the XSLT document as extraction rules,. The system is of great importance on PDF documents retrieval and management based on semantic. Furthermore, the architecture of the system and design of the main components are also valuable for other IE Systems.
Keywords/Search Tags:Information Extraction, XML, PDF, HTML
PDF Full Text Request
Related items