Learning to adapt information extraction knowledge across multiple Web sites

Posted on:2008-12-19

Degree:Ph.D

Type:Thesis

University:The Chinese University of Hong Kong (Hong Kong)

Candidate:Wong, Tak Lam

Full Text:PDF

GTID:2448390005957544

Subject:Computer Science

Abstract/Summary:

One problem of most existing Web information extraction methods is that the extraction knowledge learned from a Web site can only be applied to Web pages from the same site. This thesis first investigates the problem of wrapper adaptation which aims at adapting a wrapper previously learned from a source site to new unseen sites. A dependence model that can model the dependence between text fragments in Web pages is developed. Under this model, two types of text related features are identified. The first type of features is called site invariant features. These features likely remain unchanged in Web pages from different sites in the same domain. The second type of features is called site dependent features. These features are different in Web pages collected from different Web sites, while they are similar in Web pages originated from the same site. Based on this model, two frameworks are developed to solve the wrapper adaptation problem. The first framework is called Information Extraction Knowledge Adaptation using Machine Learning approach (IEKA-ML). Machine learning methods are employed to derive site invariant features from the previously learned extraction knowledge and items previously collected or extracted from the source Web site. Both site dependent features and site invariant features in new sites are considered for learning of new information extraction knowledge tailored to the new unseen site.;The second framework, called Information Extraction Knowledge Adaptation using Bayesian learning approach (IEKA-BAYES), solves the problem of wrapper adaptation as well as the issue of new attribute discovery. The new attribute discovery problem aims at extracting new or previously unwell attributes that are not specified in the wrapper. To harness the uncertainty, a probabilistic generative model for the generation of text fragments and layout format related to attributes in Web pages is designed. Bayesian learning and expectation-maximization (EM) techniques are developed under the proposed generative model to accomplish the wrapper adaptation task. Previously unseen attributes together with their semantic labels earl be discovered via another EM-based Bayesian learning on the generative model. Extensive experiments on over 30 real-world Web sites in three different domains and comparison between existing works have been conducted to evaluate the IEKA-ML and IEKA-BAYES frameworks.;An extension of wrapper adaptation is developed to collectively extract information from multiple Web pages. There exists mutual influence between text fragments of different Web pages and hence they should be considered collectively during extraction. Extending from the dependence model, a framework which can consider the dependence between text fragments within a single Web page and the dependence between text fragments from different pages. One characteristic of this model is that additional information can be incorporated into the model and multiple tasks earl be tackled simultaneously. As a result, a global solution which can optimize the quality of the tasks, at the same time, eliminate the conflict between them can he obtained. Experiments on product feature extraction and hot item mining from multiple auction Web sites have been conducted to demonstrate the effectiveness of this framework.

Keywords/Search Tags:

Web, Site, Extraction, Multiple, Dependence between text fragments, Wrapper adaptation, Model, Problem

Related items

1	Algorithm Research For Text Information Extraction Based On Wrapper Model
2	Researches On Models And Algorithms Of Text Information Extraction
3	Research On Wrapper Adaptation In Web Data Integration
4	Research For Information Extraction Based On Wrapper Model Algorithm
5	The Filter-wrapper MRMR-based K-dependence Bayesian Network Classifier
6	Research And Implementation Of Page Object Extraction Model For Vectical Search Engine
7	Research And Implementation On Chinese Web Pages-Oriented Information Extraction Technologies
8	Study And Design Of Text Information Extraction And Classification System
9	Web Page Attribute Extraction Method Research
10	Research On Adaptive Wrapper In Deep Web Data Extraction