Research On Adaptive Wrapper In Deep Web Data Extraction

Posted on:2014-02-16

Degree:Master

Type:Thesis

Country:China

Candidate:D L Liu

Full Text:PDF

GTID:2248330398960016

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

As the rapid development of Internet technology, Deep Web has the vast amounts of data information, and in the rapid growth of the Web to become a huge data source. The information can be accessed by the query interfaces provided by backend Web database. Deep Web contains a lot of valuable information, but the Deep Web data has the characteristics of heterogeneity and dynamic, it is a very challenging task to use the abundant information effectively. Until now, Deep Web data integration is still a hot research topic. Deep Web data integration can effectively integrate data of the Web, and it can support for e-commerce, market intelligence analysis and public opinion analysis. How to effectively extract the unstructured and semi-structured data existing in Deep Web page is a key issue of Deep Web data integration, it is the foundation for Deep Web data integration system, and it can provide services for data fusion and analysis.Many documents share common Html tree structure on script generated websites, allowing us to effectively extract interested information from deep webpage by wrappers. Since tree structure evolves over time, the wrappers break frequently and need to be re-learned. The adaptive wrapper in Deep Web data extraction, there are still has the following problems to be resolved:(1) How to improve the robustness of the wrapper, i.e., the wrapper will continue to work on the future new versions of the webpage for extracting the interested information when webpage evolves over time.(2) The study of extraction rules has the versatility, namely the extraction rules for a data source adaptively adjusted in order to be applied to other data sources.This dissertation aims at Web data integration for Deep Web data and places focus on the above two issues, exploring the problem of constructing robust wrappers for Deep Web data extraction. The innovative research works and contributions of this dissertation mainly include the following aspects.(1) In order to keep Web extraction robust when webpage changes, an approach based on minimum cost script edit model is proposed for robust Web extraction. Computing the change frequencies of three edit operations (insertion, deletion, substitution) for each Html label by monitoring a set of web pages over time, and then calculate the corresponding edit costs. The wrapper is created based on the minimum cost edit script and model learner. According to the change of the page, adjusting the extraction rules and making the wrapper can better adapt to the changes in the website. Experimental results show that the proposed approach can improve the extraction accuracy of target data, effectively solve the adaptive wrapper and improve the robustness and flexibility of the Deep Web data extraction.(2) A minimal candidate wrapper generation method based on bottom-up fashion is proposed. This method makes both precision and recall of the wrapper is close to1as possible for improving the extraction accuracy of target data. Experimental results show that the proposed approach has a smaller breakage of wrapper.(3) For the versatility of the wrapper, an approach based on bootstrapping is proposed for Deep Web data extraction. Firstly, we obtain the extraction model for51job site. Then, we aims at other recruitment website such as Zhaopin site and Yingjiesheng site, and random sampling lots of pages using this extraction model for training the new wrapper. The extraction model is used for the new site to identify the attribute value after extracting features from training data. The resulting extract rules instead of the label samples page. This work bootstrapped onto new sites using training examples from other web sites. The wrapper has high versatility, realizing the domain adaptation extraction. Experimental results show that the proposed approach can improve the extraction accuracy of target data, effectively solve the adaptive wrapper for the massive Deep Web data.

Keywords/Search Tags:

Deep Web Data Integration, Deep Web Data Extraction, Wrapper, Minimum Cost Script Edit Model, Bootstrapping

PDF Full Text Request

Related items

1	Research On Deep Web Data Acquisition Based On Visual Information And DOM Tree
2	Research On Wrapper Adaptation In Web Data Integration
3	Research On Key Issues In Deep Web Data Integration
4	Study On Methods Of Ontologâ€”Based Deep Web Data Integration
5	The Study Of Deep Web Data Integration System Design And Application
6	Research On Web Data Extraction For Web Data Integration
7	Research On Deep Web Oriented Information Extraction And Integration
8	Research On Web Information Extraction Based On Script Code And Local Data Matching
9	Research On Key Technologies Of Deep Web Information Integration
10	Research On Data Extraction And Schema Recognition On Deep Web