Font Size: a A A

DM ETL Data Quality Management System Design And Implementation

Posted on:2013-08-10Degree:MasterType:Thesis
Country:ChinaCandidate:B B ZhaoFull Text:PDF
GTID:2248330392457834Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of information technology, the enterprise has accumulated largeamounts of data.The enterprise begin to integrate dispersive business data and build datawarehouse for decision support. However in the process of building a data warehouse,data in data source have kinds of data quality problems, such as the missing, the abnormal,repeat data, and data integration may bring in new data quality problems. Those eventuallylead the quality of data in data warehouse is low. High quality decision-making dependson high quality data, so we must provide quality management and control for data.In order to solve the data quality problems during building data warehouse, the dataquality management system is designed to monitor and manage the quality of data frommultiple sources, which provides cleaning method for problem data. The system includesdata quality detecting component, duplicated data detection module, the error datacleaning module and the quality problems statistical module. Data quality detectingcomponent detect various error data by validated rules. Duplicated data detectioncomponent mainly identify and treat data duplicate records, and the method for calculatingthe similarity of Chinese field, using semantic based edit distance, is introduced to identifyduplicate records, which can improve accuracy of detecting approximately duplicaterecords that contain Chinese fields. Error data cleaning module mainly clean dirty dataand backflow them. Quality problems statistical module do statistical summary forproblem data information that generated during the detection of data quality detectingcomponent and provide a visual chart display.Finally, a testing is designed for the system. Two groups of experiments are designed,one group is testing the process for detection and cleaning error data for the system,another group is a contrast experiment for duplicate records detection using different fieldmatching method. The results show that the system is correct and effective, and contrastexperiment show that the new method of detecting duplicate records containing Chinesefields has better accuracy.
Keywords/Search Tags:data quality management, data cleaning, duplicate records
PDF Full Text Request
Related items