Algorithm Research For Text Information Extraction Based On Wrapper Model

Posted on:2007-04-11

Degree:Master

Type:Thesis

Country:China

Candidate:J P Wang

Full Text:PDF

GTID:2178360185965358

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of the Internet techniques, the information on the Internet increases exponentially. One of important researches focuses on how to automatically deal with these great capacities of online documents. Text information extraction is a natural language processing task that involves automatically extracting specific types of information from text, such as events and facts, forms structured data, and then populates database slots for queries. Text information extraction is an important method of processing large quantity of text. This thesis mainly studies relative algorithms on text information extraction based on wrapper model.This paper firstly summarizes three familiar text information extraction models, comparing these three models'strongpoint and shortcoming, and emphasizes on the information extraction model based on wrapper. After analyzing typical text information extraction inductive algorithms based on wrapper model, making use of the important features of the pages, such as annotations and text pattern features, an inductive algorithm is proposed using wrapper model for information extraction. The new algorithm can add the annotations to the state sets of rule's Finite State Machine, so it can effectively reduce the time spending on search, and can also accurately locate the target information;The learned text pattern can be used to filter out the un-interrelated extracted information. The experimental results show that the new algorithm has higher precision and recall.WEB pages are extremely dynamic and continually evolving, which results in frequent changes in their structures, consequently, wrappers may stop working in the presence of these changes. In order to solve the problem,in this paper, a novel approach is proposed to automatic wrapper maintenance. It is based on the observation that despite various page changes, many important pages features of the pages are preserved. Using these preserved features to identify the locations of the desired values in the changed pages, and repair wrappers correspondingly. The experimental results show the new algorithm is able to adapt itself to most changes of WEB pages, can automatically create new inductive algorithm for new pages, and also has higher precision and recall.Considering the high cost to label the training data manually during the experiment, this thesis applies some active learning algorithms to the new inductive algorithm above mentioned, and use the active technique to select the optimal worth examples for uses to label,...

Keywords/Search Tags:

Text Information Extraction, Wrapper Induction, Wrappper Maintenance, Active Learning

PDF Full Text Request

Related items

1	The Study Of Semi-supervised Web Data Extraction Rule Induction Based On User Interaction
2	Research And Implementation On Chinese Web Pages-Oriented Information Extraction Technologies
3	Researches On Models And Algorithms Of Text Information Extraction
4	Scalable Detection and Extraction of Data in Lists in OCRed Text for Ontology Population Using Semi-Supervised and Unsupervised Active Wrapper Induction
5	A Domain Knowledge-based Personalized Comparison Shopping System: Design And Implementation
6	Study And Design Of Text Information Extraction And Classification System
7	Learning to adapt information extraction knowledge across multiple Web sites
8	Research For Information Extraction Based On Wrapper Model Algorithm
9	Web Page Attribute Extraction Method Research
10	Research On Wrapper Adaptation In Web Data Integration