Font Size: a A A

Research On Key Technologies Of Conflict Resolution On Massive Dirty Data

Posted on:2014-08-01Degree:MasterType:Thesis
Country:ChinaCandidate:L JiaFull Text:PDF
GTID:2268330422450627Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Big data exists in extensive areas and it brings a lot of bad data meanwhile. So,data quality is of more importance in big data generation. But its correspondingapproaches and definitions are far from maturity which does not meet the commandof industry. In our paper, we develop researches on hot points; improve its behaviorin practice and our main sight focus on conflict resolution–a sub area of datacleaning. Our methods give consideration to efficiency and generalization ability.Our contribution is in three points mainly: an incremental truth discovery algorithm,a concurrent truth discovery algorithm, a confliction resolving framework and itsimplementation.In the process of integrating data from different sources, for one entity, some ofits description may be conflicting; we need make conflict resolution in this data ofpoor quality. Our definition is to improve the quality of data itself. The kernelstrategy of conflict resolution is truth discovery; certainly it can be an individualapplication. We give a serial of definitions of conflict resolution-types andresolution, a framework and realize it. Experiments show good results.In practice, input data may come incrementally during data integration, staticalgorithm can’t adapt for this situation. So, to make truth discovery algorithm morepractical, we present an incremental strategy in multisource integration usingboosting like ensemble classifier. Our algorithm is adaptive for different updatesituations by considering concept drift in learning process. We complete thesealgorithms using MapReduce for the parallelization of large data. Experiment showsthat our algorithm behaves well both in synthesized data and real data set.
Keywords/Search Tags:conflict resolutin, truth discovery, incremental, Hadoop, big data, dataquality
PDF Full Text Request
Related items