Font Size: a A A

Research On Key Issues In Deep Web Data Integration

Posted on:2011-04-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Q DongFull Text:PDF
GTID:1118360305450914Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of network technology, Web has become a huge information source with the massive data that have important value. At present, it is urgent in many application domains, such as market intelligence analysis, to analyze and mine these data to get useful knowledge that can be used to aid decision making. However, Web data have such characteristics as heterogeneity, autonomy and distribution, which make the analysis and mining difficult. In order to facilitate analysis and mining, Web data integration has been an urgent problem. According to the depth of data stored in Web, Web can be divided into two parts, Surface Web and Deep Web. The capacity and quality of the data in Deep Web have already far beyond those in Surface Web, so how to integrate Deep Web data to facilitate analysis and mining has good application effect and broad prospects.Recently, research efforts have been focused on query-oriented Deep Web data integration, which obtains a limited amount of data and is suitable for user queries on the fly. However, the integration method is not fit for the applications with the goal of analysis and mining. The thesis mainly researches on analysis-oriented Deep Web data integration. The goal of this integration method is to obtain deep web pages as much as possible and use the extraction and deduplication techniques to get structural, high-quality data that are the data basis of analysis and mining. For analysis-oriented Deep Web data integration, there are the following issues which need to be resolved:(1) As analyses require plenty of data which come from Deep Web pages dynamically generated by multiple of Web databases in the same domain, it needs to automatically acquire maximum pages. (2)As analyses require well-formed, semantic-rich data which exist in complex, semi-structured Deep Web pages, it needs to accurately extract the structural data and do the semantic understanding of them. (3)As analyses require consistent, high-quality data which exists in multiple Web databases in the same domain with high repetitive rate, it needs to detect duplicated records among these Web databases.This dissertation aims at analysis-oriented Deep Web data integration and places focus on the issues that need to be resolved. The main research works and contributions are as follows.1. A query interface matching approach based on extended evidence theory is proposed to effectively solve the problem of semantic understanding of query interfaces in different Web database crawling.There are a large number of Web databases in the same domain. The heterogeneities among query interfaces of these Web databases make it very difficult to recognize the interface attributes which are used to submit the query terms in a unified approach. To solve this issue, a query interface matching approach based on extended evidence theory is proposed, which constructs the matches between the query interface of the Web database to be crawled and its domain query interface to understand its semantic information. The approach fully utilizes multiple features of query interfaces and constructs different matchers. Then it extends traditional evidence theory with the credibilities of different matchers which are predicted dynamically to combine the results of multiple matchers. Finally, it performs one-to-one matching decision in terms of top-k global optimal policy and uses some heuristic rules of tree structure to perform one-to-many matching decision. Experimental results show that the proposed approach can improve the matching accuracy and can overcome the limitations of poor adaptabilities of traditional approaches.2. A Web database crawling approach based on query harvest rate model is proposed to effectively solve the large-scale acquisition problem of Deep Web pages.The analysis and mining applications need a large number of Deep Web data which come from Deep Web pages generated dynamically by multiple Web databases in the same domain. However, due to the special access method of Web database, the information in Deep Web cannot be crawled by traditional search engines crawler. To solve this issue, a Web database crawling approach based on query harvest rate model is proposed. The approach firstly samples the Web database and uses the sample database to select multiple kinds of features to automatically construct training instances, which avoids handful labeling. Then, it learns a query harvest rate model from the training instances. Finally, it uses the model to select the most promising query term to submit the query in every crawling round and crawls the Web database as much as possible. Experimental results show that the proposed approach can achieve high coverage of Web database and can overcome the simple and empirical limitations of traditional heuristic rules. The query harvest rate model can be effectively used to crawl other Web databases in the same domain.3. A data extraction approach for Deep Web based on hierarchical cluster is proposed to effectively solve the problem of extracting structural data in massive Deep Web pages.The structure of Deep Web page is so complex that the structural data in them are difficult to be automatically processed. To solve this issue, a data extraction approach for Deep Web based on hierarchical cluster is proposed. The approach uses the information of the list page of query result to recognize the content blocks in the Deep Web page, which determines the area of data extraction. Then it combines structural and content features from multiple Deep Web pages, and clusters content feature vectors in corresponding content blocks of these pages to effectively extract Web data records. Experimental results show that the proposed approach can significantly improve the extraction accuracy and can overcome the limitations of traditional approaches which only use the structural information of the page itself.4. A semantic annotation approach for Deep Web data based on constrained conditional random fields is proposed to effectively solve the problem of labeling the attributes without semantic labels and schema heterogeneities among data records from multiple Web sites.The extracted Web data records needs to be annotated, but only relying on existing semantic labels in Deep Web pages cannot annotate the data elements without labels and different sites often use different semantic labels, resulting in schema heterogeneity between data records from them. To solve this issue, a semantic annotation approach for Deep Web data based on constrained conditional random fields is proposed. The approach incorporates confidence constraints and logical constraints to efficiently utilize existing Web database and logical relationship among Web data elements. Then it incorporates an inference procedure based on integer linear programming and extends traditional conditional random fields to naturally and efficiently support two kinds of constraints. It uses the global attribute labels of the domain Web database schema to annotate every data elements in Web data records. Experimental results show that the proposed approach can significantly improve the accuracy of semantic annotation and overcome the limitations of traditional conditional random fields which cannot simultaneously use existing Web database and logical relationship among Web data elements.5. A duplicate record detection approach based on unsupervised learning is proposed to effectively solve the problem of massive duplicate record detection in Deep Web.Due to the large scale and high redundancy of the Deep Web, a duplicate record detection approach based on unsupervised learning is proposed. The approach firstly uses cluster ensemble to select initial training instance, which avoid handful labeling. Then it utilizes SVM classification with an iterative approach to construct classification model, which improve the accuracy of the model. Finally, it uses the voting approach to combine the results of multiple classification models to construct the domain-level duplicate record detection model, which effectively solves the problem of massive duplicate record detection. Experimental results show that the proposed approach can achieve high accuracy of duplicate record detection and the domain-level duplicate record detection model can get high performance, which overcome the limitations of traditional approaches which cannot carry out massive duplicate record detection.
Keywords/Search Tags:Deep Web Data Integration, Query Interface Matching, Web Data Extraction, Web Data Semantic Annotation, Duplicate Record Detection
PDF Full Text Request
Related items