Research On Data Quality Verification Using Data Mining Technology

Posted on:2012-03-25

Degree:Master

Type:Thesis

Country:China

Candidate:Y Liang

Full Text:PDF

GTID:2218330338467961

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Data in poor quality has become a key factor for enterprise to do the right decision, and a bottleneck of information service. Therefore, how to manage data efficiently and improve the quality to make data an effective basis for decision-making department is a problem with high research value and practical significance. In this context, this dissertation according to the different types of data errors through implementing specific program uses the appropriate solutions to verify the validity of the method.First, this dissertation introduces the definition of data quality, classification, evaluation index and the technology of improving the data quality. Second, summarize the principle and the method of data cleansing techniques. Finally, give the corresponding solutions for different error types especially on the duplicate records and similar abnormal data detection method.Fully considering the link within data, this dissertation detects abnormal data using the idea based on association rules. Firstly, convert the data in the dataset to meet the conditions for mining association rules. Secondly, find all the frequent item sets in the training set and generate the association rules from the frequent item sets and put them into the rule base. Finally, compare the records in the test set and rules in the rule base to determine whether the record is abnormal. The experiment showed that the method for the detection of abnormal data performs well.This dissertation use the method based on weight packet to detect similar duplicate records. Assign the appropriate weights to different attributes according to the ability of identifying the object, thus improve detection accuracy; Divide the large data set into small non intersect data sets according to key fields, then detect the similar duplicate records in these small data sets, which reduce the number of matches; Compute the field similarity using position-coding to solve the problem of English abbreviations and Chinese characters matching; Repeat the above steps with another key fields to overcome the character sensitive issue. The experiment proved that this method can detect similar duplicate records quickly and accurately.

Keywords/Search Tags:

data quality, abnormal data, association rules, duplicate records

PDF Full Text Request

Related items

1	Research Of Data Cleansing Algorithms For Duplicate Records Detection Problem
2	Research On Duplicate Records Identification Model In Deep Web
3	DM ETL Data Quality Management System Design And Implementation
4	Data Mining Of Electronic Medical Records Based On Association Rules-Diabetes And Its Complications As An Example
5	Based On Association Rules Mining Applications Of Electronic Medical Records
6	Data Bank Data Warehouse Build Process Of Cleaning And VIP Clients Of The Excavation
7	Research On Detection Of Approximate Duplicate Records For Massive Data
8	Research And Implementation Of Data Quality Rules Mining And Detection System
9	Association Rules In Data Mining Research And Of Teaching Quality Assessment
10	Research Of Large Amount Of Data In Chinese Commodity Cleaning Method Of The Algorithm Based On The SNM