Font Size: a A A

The Research And Application Of Duplicated Records And Incomplete Data's Cleaning Approach

Posted on:2010-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:J Y LuFull Text:PDF
GTID:2178360302466537Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the development of informatization industry, the enterprise is accumulating more and more data. There is some important information behind the explosive data, this information is crucial for the enterprise to make the proper, scientific decision and to improve the competitive strength. To meet the needs of decision analysis, data warehouse was born. In the construction of data warehouse, for various reasons, it contains duplicated, incomplete and outlier data, that is the data has quality problem. The data with high quality is the precondition of decision support, so for enhancing data quality, it is very necessary to make data cleaning.In the first place, this paper discusses some knowledge of data preprocessing, and analyzes the necessity of data cleaning and the research actuality of data cleaning at home and abroad. Then some theories about data quality and data cleaning is introduced, which expatiates the definition, principle, basic process and some techniques of data cleaning. It puts more emphases on the deep study of approximately duplicated records detection and incomplete data cleaning, and makes the improvement towards related algorithms, meanwhile designs a data cleaning prototype system based on the previous theories. The works in this paper is as follows:In order to clean the approximately duplicated records, this paper presents an approach for detecting approximately duplicated records based on cluster of inner code's sequence value. The proposed method firstly chooses the key field or some bits of it, and according to the inner code's sequence value of character, large datasets are clustered into many small datasets by cluster thought. Then in term of rank-based weights method, each attribute is endowed with certain weight. Finally, approximately duplicated records are detected and eliminated in each small dataset. To avoid missing some records caused by choosing improper key field, the multiple-detecting method can be adopted. Experimental results show the proposed method has good detection precision and time efficiency.In order to clean the incomplete data, an approach for treatment of the incomplete data based on WaveCluster and weighted 1-Nearest Neighbor (1-NN) is brought forward. Firstly dataset is divided into the complete record set and the incomplete record set. Then for the complete record set do the clustering by WaveCluster to form different subclasses. For the incomplete record, judge the availability of incomplete records. Finally, use the weighted 1-NN method to find the nearest neighbor subclass of incomplete record in the complete record set, and fill the missing attribute value of incomplete record. The experiment demonstrated the proposed method is an appropriate and effective method in treatment of the incomplete data.On the basis of analyzing and studying many data cleaning framework, a data cleaning prototype system is designed, which has open algorithms library, rules library and assessment library. It contains plenty of cleaning algorithms and many cleaning rules and provides a wide range of quality assessment methods. From the analysis of the main functions of each module of system architecture and its application, it shows that the system has good extensibility, flexibility and interactivity.
Keywords/Search Tags:data cleaning, data quality, approximately duplicated records, inner code's sequence value, incomplete data, cleaning system, extensibility
PDF Full Text Request
Related items