Font Size: a A A

Some Main Technology's Research Of Data Cleaning

Posted on:2008-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:C J BaoFull Text:PDF
GTID:2178360242488931Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the accelerated advance of global informationization,people from all walks of life are putting informationization project into practice in order to take up the inside track in fierce competition.Data warehouse application is one of the most important manifestation on the degree of informationization,which is the foundation of decision support.The data accuracy in data warehouse is crucial for the data application, furthermore,it influences the decision behavior of later period.However,as for the data in the data warehouse comes from multiple data sources which may be stored in different hardware platforms,and use different operating systems,therefore it is inevitable to produce a lot of data quality problems due to various reasons.The data quality problems are as follows: 1. approximately duplicated records; 2. outlier records. The goal of data cleaning is to arrange data, to make it standardization, to eliminate ambiguities, and improve the data quality, so data cleaning is considered as one of the most important issues needed to be solved in the construction of data warehouse.In the first place,this paper discusses some theories of data quality,then analyzes the necessities of data cleaning and the dynamic study of data cleaning at home and abroad,and then expatiates some theories about data cleaning.The study puts more emphases on the deep study of various algorithms of approximately duplicate record detection and outlier record detection,proposes corresponding improved algorithm,and designs a data cleaning framework model based on the previous theories.According to the indication of experiment and practice,the corresponding improved algorithm has very good results, and data cleaning framework model has a strong practical value.The main tasks are as follows:1. Present detecting approximately duplicate database records method based on rank group,each property of the data should be endowed with certain weight in the light of the rank-based weights method, according to the thought of grouping,choose some certain key field or some words of the field to divide large data set into many non-intersected small data sets,and then detect and eliminate approximately duplicated records in each small data set,with the introduction of the above steps that should be repeated with other key field or some words of the field.The experiment shows that such algorithm not only has a good detecting precision,but also has better efficiency of time.2. Present the outlier data detecting algorithms based on weighted fast clustering, firstly,each property of data should be endowed with certain weight to incarnate its sort devotion degree,secondly,choose comparative fine initialization subarea according to the weight characteristics of property, and get the best subarea under many times iteration,and finally find outlier data by the application of certain rule.The experiment demonstrates that this technology possesses of good effects of detecting outlier data.3. Design extendible and interactive data cleaning system,build a data cleaning framework model,advocate corresponding detecting algorithms and cleaning strategies in term of different types of duplicated records and outlier records,and provide assessment indicators.The system holds the features of extensibility, interaction and commonality which was successfully applied in data cleaning of population information system.
Keywords/Search Tags:data cleaning, data warehouse, rank grouping, approximately duplicated records, weighted fast clustering, outlier data, extensibility, interaction
PDF Full Text Request
Related items