Semi-structured In The Xml-based Web Information Extraction

Posted on:2007-11-23

Degree:Master

Type:Thesis

Country:China

Candidate:Q D Gou

Full Text:PDF

GTID:2208360185456109

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the explosion of Web, how to get the piece of information what he wants from the web has become a serious problem, so information extraction from web pages is necessary. The program that performs this task is called wrapper. The key requirements are that a wrapper can be constructed rapidly, without much human intervention, and the wrapper should be robust, adaptable to the change of web page, moreover, the wrapper should be as possible, this is, and it is independent on particular web site.Many approaches have been proposed to generate wrapper, but they have different limitations that hard to make it accurate, robust or general. This dissertation studied and analyzed those approachesThis paper apply standard technologies of XML to web extraction problem and developed a platform of semi-structured web information extraction based XML. With Inductive Learning arithmetic lactated and identified the information blocks that we want. This paper used standard XSLT and Xpath, exploiting their powers of data location and conversion, to solve the key problem: writing extraction rules.At last, this paper studied the optimization of extraction rules and compared several information location methods. The aim is to generate simple, robust and general extraction rules.

Keywords/Search Tags:

Information Extraction, XML, XSLT, Xpath

PDF Full Text Request

Related items

1	Semi-structured In The Xml-based Web Information Extraction
2	Research Of Web Information Extraction Based On XML
3	The Application Of Xpath And XSLT In Querying XML Document
4	Research And Design, Based On Xml And Xslt, Web Information Extraction
5	Research On The Semantics Of XML Family Languages
6	Web Information Extraction Based On Principle Part Extraction
7	Research And Implementation Of Web Information Extraction Based On XML
8	Smart Client And Office System Integration Of Applied Research
9	Study On Information Extraction And The Index Of Topic Search Engine
10	Design And Implementation Of Accurate Web Information Extraction System