Font Size: a A A

Research On A Model Of Data Completeness And Evaluating Algorithms

Posted on:2014-11-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y N LiuFull Text:PDF
GTID:2268330422950589Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of modern information technology, information hasincreased sharply, which also introduces poor-quality data, which affects the use ofinformation in digital society badly. Misunderstanding information leads to a greatloss. Therefore data quality has become a severe problem, which brings heateddiscussion to related problems.Handling incomplete data is one of common problems, and how to evaluatedata completeness is one of basic research problems. Reasoning on datacompleteness not only cannot reflect total completeness of a data set, but also needrefer to extra completeness information. Existing methods of evaluating datacompleteness do not take into account some false null values determined by othervalues in a data set, which leads to an underestimated data completeness. Thisdissertation investigates evaluating data completeness, and gives data completenessmodel, suitable for different applications, consisting of three kinds of completeness:attribute value completeness, tuple completeness and relation completeness. Thelatter two can be evaluated by attribute value completeness under a definedcomputing function. With functional dependencies, attribute value completeness canbe truly determined, which contributes to truly relation completeness. Based on thismodel, evaluating data completeness is investigated and formally defined. Differentlower bounds of this problem under different assumption are given, meanwhileexact algorithms reaching these bounds respectively when computing functions aredefined. Approximate algorithms based on uniform sampling are proposed toevaluate data completeness of massive data. Theoretical analysis shows theapproximate algorithms can reach any given precision. Reservoirs are introduced inapproximate algorithms to improve performance on unknown data set withprecision guaranteed. Experiments on real data and synthetic show effectiveness ofthe model and efficiency of proposed exact and approximate algorithms.
Keywords/Search Tags:data quality, data completeness, evaluating data completeness, uniformsampling
PDF Full Text Request
Related items