Font Size: a A A

Study On Information Extraction And The Index Of Topic Search Engine

Posted on:2008-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:M YuFull Text:PDF
GTID:2178360242471636Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the explosion of World Wide Web,"Information Overload"has become a serious problem. To help people accurately get the piece of information what he wants from the Web, information extraction from web pages is necessary. The program that performs this task is called wrapper. The key requirements are that a wrapper can be constructed rapidly, without much human intervention, and the wrapper should be robust, adaptable to the change of web page, moreover, the wrapper should be as general as possible, that is, it is independent on particular web site.Many approaches have been proposed to ease wrapper generation. Almost all of them use proprietary extraction languages. The languages are simple, hard to express accurate or complex extraction pattern. Although through labled examples, extraction rules can be induct automatically, they are not accurate, not robust or general. We apply standard technologies of XML to web information extraction problem.With standard XSLT, we can exploit strong and flexible features of the language to construct simple, robust and general extraction rules. We have developed a platform to ease wrapper construction.The failure of extraction rules is mainly due to the failure of XPath expression.This paper studies the optimization methods of extraction of extraction rules and put forwards several improved location methods. Moreover ,the combination sreategy of these methods is put forward to generate simple.these methods have been used in the information extraction to get better precision.
Keywords/Search Tags:XSLT, Information Extraction, XML
PDF Full Text Request
Related items