Font Size: a A A

Algorithm Research For Text Information Extraction Based On Wrapper Model

Posted on:2007-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:J P WangFull Text:PDF
GTID:2178360185965358Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet techniques, the information on the Internet increases exponentially. One of important researches focuses on how to automatically deal with these great capacities of online documents. Text information extraction is a natural language processing task that involves automatically extracting specific types of information from text, such as events and facts, forms structured data, and then populates database slots for queries. Text information extraction is an important method of processing large quantity of text. This thesis mainly studies relative algorithms on text information extraction based on wrapper model.This paper firstly summarizes three familiar text information extraction models, comparing these three models'strongpoint and shortcoming, and emphasizes on the information extraction model based on wrapper. After analyzing typical text information extraction inductive algorithms based on wrapper model, making use of the important features of the pages, such as annotations and text pattern features, an inductive algorithm is proposed using wrapper model for information extraction. The new algorithm can add the annotations to the state sets of rule's Finite State Machine, so it can effectively reduce the time spending on search, and can also accurately locate the target information;The learned text pattern can be used to filter out the un-interrelated extracted information. The experimental results show that the new algorithm has higher precision and recall.WEB pages are extremely dynamic and continually evolving, which results in frequent changes in their structures, consequently, wrappers may stop working in the presence of these changes. In order to solve the problem,in this paper, a novel approach is proposed to automatic wrapper maintenance. It is based on the observation that despite various page changes, many important pages features of the pages are preserved. Using these preserved features to identify the locations of the desired values in the changed pages, and repair wrappers correspondingly. The experimental results show the new algorithm is able to adapt itself to most changes of WEB pages, can automatically create new inductive algorithm for new pages, and also has higher precision and recall.Considering the high cost to label the training data manually during the experiment, this thesis applies some active learning algorithms to the new inductive algorithm above mentioned, and use the active technique to select the optimal worth examples for uses to label,...
Keywords/Search Tags:Text Information Extraction, Wrapper Induction, Wrappper Maintenance, Active Learning
PDF Full Text Request
Related items