Web Information Extraction Rules And Their Learning Algorithms

Posted on:2009-07-18

Degree:Master

Type:Thesis

Country:China

Candidate:Q Shi

Full Text:PDF

GTID:2178360248954848

Subject:Computer software and theory

Abstract/Summary:

With the rapid increase of Web information, Information Extraction (IE) techniques are good for automatically extracting data of interest from a mass of Web documents. The extracted information is usually stored in a database in the form of structured data, which are further used for information query, text mining, Web data analysis and automatic answering. One of the main elements in an information extraction system is extraction rules or patterns, in general, defining how to locate and identify the data of interest over multiple Web sources. Manual definition of extraction rules, required by the early systems, is expensive and time-consuming. Therefore, recent efforts have focused on the efficient way of rule generation.In this paper, we mainly work on extraction rules and their automatic generation through inductive learning algorithms. We start from the comparison of existing IE systems and the formal definition of extraction rules. Next we propose a pre-processing algorithm for generating candidates by splitting training instances, and provide an improved learning algorithm based on WHISK. With the implementation of WHISK algorithm and its variant, we end with the detailed experimental analysis over real Web contents.In contrast with WHISK, our system can perform the automatic extraction from real on-line Web pages by defining retrieval rules, extraction rules and mapping rules. The generation of extraction rules is done in two phases: providing the extraction point in terms of a path in a DOM tree, and inductive learning for regular expression by hand-tagged training examples.The comparison of the experimental results from WHISK reveals that our algorithm generates extraction rules more efficiently by than the WHISK algorithm, and our system performs well on both single-slot and multi-slot extraction tasks.

Keywords/Search Tags:

Information Extraction, Extraction Rule, DOM, Learning Algorithm

Related items

1	Web Information Extraction Rules And Their Learning Algorithms
2	The Study Of Rule Induction For Automatic WEB Data Extraction
3	Semi-structured Web Information Extraction Technology And Its Application
4	Web Text Of The Rule-based Information Extraction Technology Research
5	Study On Text Preprocessing And Automatic Rule Learning Technology For Information Extraction
6	The Key Technology Of RFC's Rule Extraction Based On NLP
7	Research On Web Informaition Extraction Techniques
8	Research On Web Product Indicator Extraction Based On Ontology
9	The Study Of Semi-supervised Web Data Extraction Rule Induction Based On User Interaction
10	Research On Network Information Extraction And Visualization Technology Of Incident