Font Size: a A A

Design And Implementation Of The Inconsistent Data Repairing Subsystem In The Data Cleaning System

Posted on:2014-06-11Degree:MasterType:Thesis
Country:ChinaCandidate:X Y MenFull Text:PDF
GTID:2268330422451701Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the popularity of Internet applications in modern society, the amount ofdata generated exponential level soared, data that inconsistent, incorrect, oruncertain started to exist in the data management system, with the expansion of thedata application scope, in every process that data involved, various reasons can leadto data errors, deviation and inconsistent problem increasingly serious, eventuallycaused the decision-making error and economic loss, which cannot be neglected.Reliable and accurate data is the basis of right decisions, however, inconsistent dat ahas been the common problem that data integrating and date exchange should face,so the effective repair of inconsistent data is the urgent requirement of developingdatabase technology and application in-depth. In the practical application, theinconsistent data repair work is mainly done artificially, according to the manual orthe underlying application auxiliary testing data, then correct the inconsistent databy the producer of provider. Along with the practical application increasinglycomplex, the database scale continuously increasing and the improvement requestof the data processing timeliness, this method has revealed its own disadvantages:one is the requirement of much manpower and time; the other is it is so restricted byexperience and the familiar level about the domain knowledge.In view of the above methods and the insufficiency, we designed andimplemented the inconsistent data repairing system for massive data; Using theconditional functional dependencies that is short for CFDs in the data dependenciestheory, CFDs can provide the data constraint as a set of rules and capture theinconsistent and errors, by detecting the data that violate CFDs we can know theinconsistent part, then solve the repair solutions, after this the result eventually meetthe data consistency, and also the deterministic probability is given. This system isimplemented as two versions: stand-alone version with the pattern main programand sub procedure, and the parallel version based on the Hadoop framework. At last,we validate our system has the very good effect on repairing data inconsistency.
Keywords/Search Tags:data quality, massive data, data repair, data consistency
PDF Full Text Request
Related items