The Research Of XML-Based Web Information Extraction

Posted on:2006-05-28

Degree:Master

Type:Thesis

Country:China

Candidate:R Lu

Full Text:PDF

GTID:2168360155464893

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

With the explosion of Web, how to get the piece of information what he wants from the web has become a serious problem, so information extraction from web pages is necessary. Wrapper is a program that performs the information extraction. The key task associated with an extraction system is how to construct accurate, robust and adaptable wrapper without much human intervention. Wrapper should be independent on particular web sites and could avoid impact from changes of web pages.Many approaches have been proposed to generate wrapper, but they have too different limitations to make wrapper accurate, robust or general.This paper develops a system of web information extraction based on XML. The key problem of information extraction is how to generate accurate, general and robust extraction rule. This paper applies standard XSLT and XPath, exploiting their powers of data location and conversion, to solve this key problem. Moreover, with example learning arithmetic, this paper realizes lactation of the information blocks that we want. And then identifies the information accurately, generates the extraction rule based on XSLT. Because the extraction rule is XSLT, they can be easily understood and revised.The failure of extraction rules is mainly due to the failure of XPath expression. This paper studies the optimization methods of extraction rules and put forwards several improved location methods. Moreover, the combination strategy of these methods is put forward to generate simple, robust and general extraction rules. These methods have been used in the information extraction to get better precision.

Keywords/Search Tags:

Information Extraction, XML, XSLT

PDF Full Text Request

Related items

1	Web Information Extraction Based On Principle Part Extraction
2	Research On Web Informaition Extraction Techniques
3	Study On Information Extraction And The Index Of Topic Search Engine
4	Semi-structured In The Xml-based Web Information Extraction
5	Design And Implementation Of Web Information Extraction Based On DOM
6	Design And Implementation Of Web Information Extraction Based On Dom
7	Research Of Web Information Extraction Based On XML
8	Automatic Extraction Of Information From Web Pages
9	The Research Of XML-Based Web Information Extraction
10	Based On The Xml Deep Web Information Extraction System With The Initial Implementation,