Font Size: a A A

Research On Adaptive Wrapper In Deep Web Data Extraction

Posted on:2014-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:D L LiuFull Text:PDF
GTID:2248330398960016Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As the rapid development of Internet technology, Deep Web has the vast amounts of data information, and in the rapid growth of the Web to become a huge data source. The information can be accessed by the query interfaces provided by backend Web database. Deep Web contains a lot of valuable information, but the Deep Web data has the characteristics of heterogeneity and dynamic, it is a very challenging task to use the abundant information effectively. Until now, Deep Web data integration is still a hot research topic. Deep Web data integration can effectively integrate data of the Web, and it can support for e-commerce, market intelligence analysis and public opinion analysis. How to effectively extract the unstructured and semi-structured data existing in Deep Web page is a key issue of Deep Web data integration, it is the foundation for Deep Web data integration system, and it can provide services for data fusion and analysis.Many documents share common Html tree structure on script generated websites, allowing us to effectively extract interested information from deep webpage by wrappers. Since tree structure evolves over time, the wrappers break frequently and need to be re-learned. The adaptive wrapper in Deep Web data extraction, there are still has the following problems to be resolved:(1) How to improve the robustness of the wrapper, i.e., the wrapper will continue to work on the future new versions of the webpage for extracting the interested information when webpage evolves over time.(2) The study of extraction rules has the versatility, namely the extraction rules for a data source adaptively adjusted in order to be applied to other data sources.This dissertation aims at Web data integration for Deep Web data and places focus on the above two issues, exploring the problem of constructing robust wrappers for Deep Web data extraction. The innovative research works and contributions of this dissertation mainly include the following aspects.(1) In order to keep Web extraction robust when webpage changes, an approach based on minimum cost script edit model is proposed for robust Web extraction. Computing the change frequencies of three edit operations (insertion, deletion, substitution) for each Html label by monitoring a set of web pages over time, and then calculate the corresponding edit costs. The wrapper is created based on the minimum cost edit script and model learner. According to the change of the page, adjusting the extraction rules and making the wrapper can better adapt to the changes in the website. Experimental results show that the proposed approach can improve the extraction accuracy of target data, effectively solve the adaptive wrapper and improve the robustness and flexibility of the Deep Web data extraction.(2) A minimal candidate wrapper generation method based on bottom-up fashion is proposed. This method makes both precision and recall of the wrapper is close to1as possible for improving the extraction accuracy of target data. Experimental results show that the proposed approach has a smaller breakage of wrapper.(3) For the versatility of the wrapper, an approach based on bootstrapping is proposed for Deep Web data extraction. Firstly, we obtain the extraction model for51job site. Then, we aims at other recruitment website such as Zhaopin site and Yingjiesheng site, and random sampling lots of pages using this extraction model for training the new wrapper. The extraction model is used for the new site to identify the attribute value after extracting features from training data. The resulting extract rules instead of the label samples page. This work bootstrapped onto new sites using training examples from other web sites. The wrapper has high versatility, realizing the domain adaptation extraction. Experimental results show that the proposed approach can improve the extraction accuracy of target data, effectively solve the adaptive wrapper for the massive Deep Web data.
Keywords/Search Tags:Deep Web Data Integration, Deep Web Data Extraction, Wrapper, Minimum Cost Script Edit Model, Bootstrapping
PDF Full Text Request
Related items