Research On Techniques For Deep Web Data Fusion Based On Source Dependence

Posted on:2014-01-21

Degree:Master

Type:Thesis

Country:China

Candidate:S S Lu

Full Text:PDF

GTID:2248330398465315

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In Deep Web data mining, data conflicts often arise among different data sources. Howto resolve these conflicts and obtain correct values (known as data fusion) is a key issue indata integration. In the ideal case that data sources are all independent, if the number ofdata sources providing a correct value is more than that providing false ones, with a votingmechanism, we can easily take the value provided by the majority of the sources as thetruth. However, web technologies have simplified copying and also enabled copyingrelationship to be complex. To present high-quality data to users, we expect to ignore thecopying information when fusing data in a top-k query interface.This paper applies statistical methods to analyzing dependencies among different datasources. Then we introduce the dependencies into online data fusion and data integrationframework, so that users can get more accurate results for maximum coverage at minimumcost. This work includes following three points:(1) Propose a method to detect dependencies between data sources. The method usesBayesian analysis to determine dependencies between data sources and design an iterativealgorithm to detect dependencies and fuse data. Moreover, we extend the algorithm byconsidering source accuracy and similarities among attributes, improving data fusionresults.(2) Describe a kind of technology to discover complex copying relationship among aset of data sources. First, we revise the above local detection algorithm and raise aframework. The framework can insert different types of copying evidence and considerscopying correlation on different data items, to satisfy accuracy requirements of globaldetection on copying direction. Secondly, we put forward a global detection model and caneliminate complex copying relationship such as co-copiers, multi-source copiers andtransitive copiers, only returning a pair of data source with direct copying. (3) Introduce dependencies to build an online data fusion system. With access to thefirst data source, it calculates vote counts incrementally, returns answers with a confidencerange and terminates when meeting certain conditions. We also design a data sourcesorting algorithm to gain a fast convergence and return high-quality answers earlier.The methods presented in this paper are also experimented on real data sets and resultsshow that our technology is feasible and effective.

Keywords/Search Tags:

Deep Web, Data Integration, Data Fusion, Data Conflict, Source Dependence, Copy Detection

PDF Full Text Request

Related items

1	Research On Data Fusion For Web Data Integration
2	Research On Data Fusion For Deep Web Data Ntegration
3	Research On Deep Web Sources Classification
4	Data Integration Research Of Power-generating Groups
5	Research On Key Issues In Deep Web Data Integration
6	Study Of The Detecting And Resolving Data Conflict In Data Integration
7	Study On Application Of XML In Data Integration
8	Design And Implementation Of Data Integration Platform Oriented To The Shared Database Center
9	The Design And Implement Of Data Synchronization Software In ERP System Of A Enterprise
10	Algorithms Of Copy Detection And Truth Discovery For Multi-relational Data