Font Size: a A A

Research On Data Quality Verification Using Data Mining Technology

Posted on:2012-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiangFull Text:PDF
GTID:2218330338467961Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data in poor quality has become a key factor for enterprise to do the right decision, and a bottleneck of information service. Therefore, how to manage data efficiently and improve the quality to make data an effective basis for decision-making department is a problem with high research value and practical significance. In this context, this dissertation according to the different types of data errors through implementing specific program uses the appropriate solutions to verify the validity of the method.First, this dissertation introduces the definition of data quality, classification, evaluation index and the technology of improving the data quality. Second, summarize the principle and the method of data cleansing techniques. Finally, give the corresponding solutions for different error types especially on the duplicate records and similar abnormal data detection method.Fully considering the link within data, this dissertation detects abnormal data using the idea based on association rules. Firstly, convert the data in the dataset to meet the conditions for mining association rules. Secondly, find all the frequent item sets in the training set and generate the association rules from the frequent item sets and put them into the rule base. Finally, compare the records in the test set and rules in the rule base to determine whether the record is abnormal. The experiment showed that the method for the detection of abnormal data performs well.This dissertation use the method based on weight packet to detect similar duplicate records. Assign the appropriate weights to different attributes according to the ability of identifying the object, thus improve detection accuracy; Divide the large data set into small non intersect data sets according to key fields, then detect the similar duplicate records in these small data sets, which reduce the number of matches; Compute the field similarity using position-coding to solve the problem of English abbreviations and Chinese characters matching; Repeat the above steps with another key fields to overcome the character sensitive issue. The experiment proved that this method can detect similar duplicate records quickly and accurately.
Keywords/Search Tags:data quality, abnormal data, association rules, duplicate records
PDF Full Text Request
Related items