Research On The HTML And PDF Informaiton Extraction Technology Based XML

Posted on:2007-12-30

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Song

Full Text:PDF

GTID:2178360182473261

Subject:Computer software and theory

Abstract/Summary:

As societies move into the information age, users are confronted with considerable quantities of data, most of which are stored in the format of HTML, PDF and others.PDF and HTML are good at describing the format of data display, while bad at the content revelation. As a result, it has become a vital obstacle for people to retrieve information. XML is the data-exchange standard proposed by W3C.It is content-oriented, which just overcomes the disadvantage of HTML and PDF. So, it is urgent to extract information from HTML and PDF documents and transfer them into XML documents. And just this forms my research background on this topic. Rule-based approach is the most popular method for IE(Information Extraction). Firstly, this paper puts forward an approach to the problem of IE. The approach extracts information using the standard XSLT and XPath technologies, which has advantages on data location and document conversion. Secondly, it summarizes the basic theory and standard relating to IE based on XML. In order to generate simple, robust and general extraction rules, we also study the optimization of XSLT extraction rules. Based on these work above, this paper focuses on the realization of extracting information from HTML and PDF documents. The core work of this essay is to develop a system of PDF Information Extraction based on XML. The distinguishing feature of it lies in choosing XML as information display model, and XSLT as information extraction rule. The fundamental thought can be expressed as follows: converting the PDF-formatted document to a XML-formatted middle document first, then applying XSLT rules to the middle document according to its description on text, location and display. The system contains three models. Middle-document Producing Model is the first, the result of which is an XMLâ€“formatted middle document describing the display-style and the layout structure features of PDF documents. The second is Rules Producing Model. This model adopts semi-automatic manner to generate XSLT rules. The last is Automatic Extraction Model, which transfers the middle document to a self-described and semi-structured XML document by using the XSLT document as extraction rules,. The system is of great importance on PDF documents retrieval and management based on semantic. Furthermore, the architecture of the system and design of the main components are also valuable for other IE Systems.

Keywords/Search Tags:

Information Extraction, XML, PDF, HTML

Related items

1	Research On The Technology Of The Web Employment Information Extraction Based On The HTML
2	Based On The Html Pages Of Web Information Extraction
3	Research On The HTML And PDF Informaiton Extraction Technology Based XML
4	The Technology Of Web Information Extraction Based On HTML Parser
5	Data Extraction And Integration In HTML Tables
6	ClusTex: Using clustering techniques for information extraction from HTML pages containing semi-structured data
7	Study On Tables Information Extraction Based On Web
8	The Research On Web Information Extraction Based On HMM
9	Semi-structured Web Information Extraction Technology And Its Application
10	Research On Web Information Extraction Tool