Font Size: a A A

The Analysis And Application Of Genetic Algrorithm In Synchronized Web Data Extraction

Posted on:2010-10-29Degree:MasterType:Thesis
Country:ChinaCandidate:L R WanFull Text:PDF
GTID:2178360278972609Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The deep web presents a pressing need for integrating large numbers of dynamically evolving data sources. To be more automatic in building an integration system, we observe three problems:First, across sequential tasks in spider the peer sources to facilitate the subsequent matching task?Second, across parallel sources, how can a wrapper leverage the peer wrappers or domain rules to enhance extraction accuracy?Third, how to improve the extracting algorithm to enhance the extraction accuracy and the algorithm efficiency.These issues, while seemingly unrelated, both boil down to the lack of "context awareness". Current automatic wrapper induction approaches generate a wrapper for one source at a time, in isolation, and thus inherently lack the awareness of the peer sources or domain knowledge in the context of integration.In this paper, we propose the concept of context-awareness wrappers that are amenable to matching and that can leverage peer wrappers or prior domain knowledge. Such context awareness inspires a synchronization frame-work to construct wrappers consistently and collaboratively across their mutual context. We draw the insight from turbo codes and apply the genetic algorithm to develop a turbo syncer to interconnect extraction with matching, which together achieve context awareness in wrapping.The main works and achievements of this paper are:1. We discuss the synchronized data extraction in deep web and we propose the concept of context-awareness wrappers.2. We apply the genetic algorithm to develop a turbo syncer to interconnect extraction with matching, which together achieve context awareness in wrapping.3. We leverage the peer sources, peer wrappers and domain rules to enhance extraction accuracy. The contribution of this paper is that we discuss the problem that how to realize the Context-Ware Wrapping. We consider the peer sources to facilitate the matching task and enhance a wrapper's extraction accuracy by leverage the peer wrappers or domain rule. First, we bring in the concept Context-Ware Wrapping. With the problem how to realize it, then we propose a Spiral-Decoding Method to synchronize the extractions by spiral decoding. At last, we apply the genetic algorithm to develop a turbo syncer to realize it.
Keywords/Search Tags:Web data extraction, genetic algorithm, Deep Web, Context-Aware
PDF Full Text Request
Related items