Font Size: a A A

Web Information Extraction Rules And Their Learning Algorithms

Posted on:2009-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:Q ShiFull Text:PDF
GTID:2178360248954848Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid increase of Web information, Information Extraction (IE) techniques are good for automatically extracting data of interest from a mass of Web documents. The extracted information is usually stored in a database in the form of structured data, which are further used for information query, text mining, Web data analysis and automatic answering. One of the main elements in an information extraction system is extraction rules or patterns, in general, defining how to locate and identify the data of interest over multiple Web sources. Manual definition of extraction rules, required by the early systems, is expensive and time-consuming. Therefore, recent efforts have focused on the efficient way of rule generation.In this paper, we mainly work on extraction rules and their automatic generation through inductive learning algorithms. We start from the comparison of existing IE systems and the formal definition of extraction rules. Next we propose a pre-processing algorithm for generating candidates by splitting training instances, and provide an improved learning algorithm based on WHISK. With the implementation of WHISK algorithm and its variant, we end with the detailed experimental analysis over real Web contents.In contrast with WHISK, our system can perform the automatic extraction from real on-line Web pages by defining retrieval rules, extraction rules and mapping rules. The generation of extraction rules is done in two phases: providing the extraction point in terms of a path in a DOM tree, and inductive learning for regular expression by hand-tagged training examples.The comparison of the experimental results from WHISK reveals that our algorithm generates extraction rules more efficiently by than the WHISK algorithm, and our system performs well on both single-slot and multi-slot extraction tasks.
Keywords/Search Tags:Information Extraction, Extraction Rule, DOM, Learning Algorithm
PDF Full Text Request
Related items