Font Size: a A A

The Research Of XML-Based Web Information Extraction

Posted on:2006-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:R LuFull Text:PDF
GTID:2168360155464893Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the explosion of Web, how to get the piece of information what he wants from the web has become a serious problem, so information extraction from web pages is necessary. Wrapper is a program that performs the information extraction. The key task associated with an extraction system is how to construct accurate, robust and adaptable wrapper without much human intervention. Wrapper should be independent on particular web sites and could avoid impact from changes of web pages.Many approaches have been proposed to generate wrapper, but they have too different limitations to make wrapper accurate, robust or general.This paper develops a system of web information extraction based on XML. The key problem of information extraction is how to generate accurate, general and robust extraction rule. This paper applies standard XSLT and XPath, exploiting their powers of data location and conversion, to solve this key problem. Moreover, with example learning arithmetic, this paper realizes lactation of the information blocks that we want. And then identifies the information accurately, generates the extraction rule based on XSLT. Because the extraction rule is XSLT, they can be easily understood and revised.The failure of extraction rules is mainly due to the failure of XPath expression. This paper studies the optimization methods of extraction rules and put forwards several improved location methods. Moreover, the combination strategy of these methods is put forward to generate simple, robust and general extraction rules. These methods have been used in the information extraction to get better precision.
Keywords/Search Tags:Information Extraction, XML, XSLT
PDF Full Text Request
Related items