Font Size: a A A

Research On Techniques For Deep Web Data Fusion Based On Source Dependence

Posted on:2014-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:S S LuFull Text:PDF
GTID:2248330398465315Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In Deep Web data mining, data conflicts often arise among different data sources. Howto resolve these conflicts and obtain correct values (known as data fusion) is a key issue indata integration. In the ideal case that data sources are all independent, if the number ofdata sources providing a correct value is more than that providing false ones, with a votingmechanism, we can easily take the value provided by the majority of the sources as thetruth. However, web technologies have simplified copying and also enabled copyingrelationship to be complex. To present high-quality data to users, we expect to ignore thecopying information when fusing data in a top-k query interface.This paper applies statistical methods to analyzing dependencies among different datasources. Then we introduce the dependencies into online data fusion and data integrationframework, so that users can get more accurate results for maximum coverage at minimumcost. This work includes following three points:(1) Propose a method to detect dependencies between data sources. The method usesBayesian analysis to determine dependencies between data sources and design an iterativealgorithm to detect dependencies and fuse data. Moreover, we extend the algorithm byconsidering source accuracy and similarities among attributes, improving data fusionresults.(2) Describe a kind of technology to discover complex copying relationship among aset of data sources. First, we revise the above local detection algorithm and raise aframework. The framework can insert different types of copying evidence and considerscopying correlation on different data items, to satisfy accuracy requirements of globaldetection on copying direction. Secondly, we put forward a global detection model and caneliminate complex copying relationship such as co-copiers, multi-source copiers andtransitive copiers, only returning a pair of data source with direct copying. (3) Introduce dependencies to build an online data fusion system. With access to thefirst data source, it calculates vote counts incrementally, returns answers with a confidencerange and terminates when meeting certain conditions. We also design a data sourcesorting algorithm to gain a fast convergence and return high-quality answers earlier.The methods presented in this paper are also experimented on real data sets and resultsshow that our technology is feasible and effective.
Keywords/Search Tags:Deep Web, Data Integration, Data Fusion, Data Conflict, Source Dependence, Copy Detection
PDF Full Text Request
Related items