Font Size: a A A

Learning to adapt information extraction knowledge across multiple Web sites

Posted on:2008-12-19Degree:Ph.DType:Thesis
University:The Chinese University of Hong Kong (Hong Kong)Candidate:Wong, Tak LamFull Text:PDF
GTID:2448390005957544Subject:Computer Science
Abstract/Summary:
One problem of most existing Web information extraction methods is that the extraction knowledge learned from a Web site can only be applied to Web pages from the same site. This thesis first investigates the problem of wrapper adaptation which aims at adapting a wrapper previously learned from a source site to new unseen sites. A dependence model that can model the dependence between text fragments in Web pages is developed. Under this model, two types of text related features are identified. The first type of features is called site invariant features. These features likely remain unchanged in Web pages from different sites in the same domain. The second type of features is called site dependent features. These features are different in Web pages collected from different Web sites, while they are similar in Web pages originated from the same site. Based on this model, two frameworks are developed to solve the wrapper adaptation problem. The first framework is called Information Extraction Knowledge Adaptation using Machine Learning approach (IEKA-ML). Machine learning methods are employed to derive site invariant features from the previously learned extraction knowledge and items previously collected or extracted from the source Web site. Both site dependent features and site invariant features in new sites are considered for learning of new information extraction knowledge tailored to the new unseen site.;The second framework, called Information Extraction Knowledge Adaptation using Bayesian learning approach (IEKA-BAYES), solves the problem of wrapper adaptation as well as the issue of new attribute discovery. The new attribute discovery problem aims at extracting new or previously unwell attributes that are not specified in the wrapper. To harness the uncertainty, a probabilistic generative model for the generation of text fragments and layout format related to attributes in Web pages is designed. Bayesian learning and expectation-maximization (EM) techniques are developed under the proposed generative model to accomplish the wrapper adaptation task. Previously unseen attributes together with their semantic labels earl be discovered via another EM-based Bayesian learning on the generative model. Extensive experiments on over 30 real-world Web sites in three different domains and comparison between existing works have been conducted to evaluate the IEKA-ML and IEKA-BAYES frameworks.;An extension of wrapper adaptation is developed to collectively extract information from multiple Web pages. There exists mutual influence between text fragments of different Web pages and hence they should be considered collectively during extraction. Extending from the dependence model, a framework which can consider the dependence between text fragments within a single Web page and the dependence between text fragments from different pages. One characteristic of this model is that additional information can be incorporated into the model and multiple tasks earl be tackled simultaneously. As a result, a global solution which can optimize the quality of the tasks, at the same time, eliminate the conflict between them can he obtained. Experiments on product feature extraction and hot item mining from multiple auction Web sites have been conducted to demonstrate the effectiveness of this framework.
Keywords/Search Tags:Web, Site, Extraction, Multiple, Dependence between text fragments, Wrapper adaptation, Model, Problem
Related items