Font Size: a A A

Semi-structured In The Xml-based Web Information Extraction

Posted on:2007-11-23Degree:MasterType:Thesis
Country:ChinaCandidate:Q D GouFull Text:PDF
GTID:2208360185456109Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the explosion of Web, how to get the piece of information what he wants from the web has become a serious problem, so information extraction from web pages is necessary. The program that performs this task is called wrapper. The key requirements are that a wrapper can be constructed rapidly, without much human intervention, and the wrapper should be robust, adaptable to the change of web page, moreover, the wrapper should be as possible, this is, and it is independent on particular web site.Many approaches have been proposed to generate wrapper, but they have different limitations that hard to make it accurate, robust or general. This dissertation studied and analyzed those approachesThis paper apply standard technologies of XML to web extraction problem and developed a platform of semi-structured web information extraction based XML. With Inductive Learning arithmetic lactated and identified the information blocks that we want. This paper used standard XSLT and Xpath, exploiting their powers of data location and conversion, to solve the key problem: writing extraction rules.At last, this paper studied the optimization of extraction rules and compared several information location methods. The aim is to generate simple, robust and general extraction rules.
Keywords/Search Tags:Information Extraction, XML, XSLT, Xpath
PDF Full Text Request
Related items