Font Size: a A A

Research On Truth Discovery Based On Bayes Model In Web Data Integeration

Posted on:2016-12-22Degree:MasterType:Thesis
Country:ChinaCandidate:D YuFull Text:PDF
GTID:2428330542957396Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In many web integration applications,there are usually some sources that depict the same entity object with different descriptions,which leads to conflicts.Resolving conflicts and finding truth can be used to improve the quality of integration or to build a high-quality knowledge base,etc.In the single-truth data conflicting scenario,the source quality metrics of existing methods are inadequate and these methods can't handle the complex data copying,etc.As to the multiple-truth data conflicts,some methods belong to the supervised algorithms and the artificial cost is large.The other methods apply probabilistic graphical model to find truths.However,if the distributions that the latent variables assumed in the model obey are not fit the actual data set,the results would be bad.Therefore,we conduct researches on single-truth and multiple-truth data conflicts from the following three parts.The first part is the single source quality metric based truth discovery for single-truth scenario.We use the single source quality metric,which is accuracy,to measure source quality and we both consider the facts provided by the source and the corresponding conflicting facts that the source doesn't provide while computing the accuracy of a source.Moreover,as to the problem that error data be erroneously regarded as truths because of data copying,for each fact we take the sources that provide the fact as a whole and use the joint accuracy of source group to capture the data copying that probably exists among the sources.Then we propose the Bayes based algorithms,which base on the single source quality metric,to solve the single-truth data conflicts problem.The second part is the mutiple source quality metrics based truth discovery for single-truth scenario.In this part,this thesis turns to measuring source quality with the multiple metrics,namely the recall and false positive rate,which can effectively distinguish false negative and false positive.Thus we propose a more accurate truth discovery algorithm conditioned on assuming that sources are independent.Moreover,we use the joint recall and joint false positive rate to handle the data copying among sources.So we propose a truth discovery algorithm that can deal with the complex data copying among sources.The third part is the mutiple source quality metrics based truth discovery for mutiple-truth scenario.We analyze the differences between multiple-truth scenario and single-truth scenario.Then we use the recall and false positive rate to measure source quality and compute the two metrics of a source during the scope of entities that the source covers.Subsequently,we introduce an unsupervised algorithm to handle the truth discovery problem under the source independence assumption.Finally,we conduct many experiments on real-world data sets and synthetic data sets and the results prove the effectiveness of multiple metrics of source quality.Meanwhile,the results show that our methods can effectively resolve the single-truth and multiple-truth data conflicts respectively.
Keywords/Search Tags:truth discovery, data conflicts, data copying, single truth, multiple truths, data integration
PDF Full Text Request
Related items