The Research And Application Of Duplicated Records And Incomplete Data's Cleaning Approach

Posted on:2010-03-11

Degree:Master

Type:Thesis

Country:China

Candidate:J Y Lu

Full Text:PDF

GTID:2178360302466537

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

As the development of informatization industry, the enterprise is accumulating more and more data. There is some important information behind the explosive data, this information is crucial for the enterprise to make the proper, scientific decision and to improve the competitive strength. To meet the needs of decision analysis, data warehouse was born. In the construction of data warehouse, for various reasons, it contains duplicated, incomplete and outlier data, that is the data has quality problem. The data with high quality is the precondition of decision support, so for enhancing data quality, it is very necessary to make data cleaning.In the first place, this paper discusses some knowledge of data preprocessing, and analyzes the necessity of data cleaning and the research actuality of data cleaning at home and abroad. Then some theories about data quality and data cleaning is introduced, which expatiates the definition, principle, basic process and some techniques of data cleaning. It puts more emphases on the deep study of approximately duplicated records detection and incomplete data cleaning, and makes the improvement towards related algorithms, meanwhile designs a data cleaning prototype system based on the previous theories. The works in this paper is as follows:In order to clean the approximately duplicated records, this paper presents an approach for detecting approximately duplicated records based on cluster of inner code's sequence value. The proposed method firstly chooses the key field or some bits of it, and according to the inner code's sequence value of character, large datasets are clustered into many small datasets by cluster thought. Then in term of rank-based weights method, each attribute is endowed with certain weight. Finally, approximately duplicated records are detected and eliminated in each small dataset. To avoid missing some records caused by choosing improper key field, the multiple-detecting method can be adopted. Experimental results show the proposed method has good detection precision and time efficiency.In order to clean the incomplete data, an approach for treatment of the incomplete data based on WaveCluster and weighted 1-Nearest Neighbor (1-NN) is brought forward. Firstly dataset is divided into the complete record set and the incomplete record set. Then for the complete record set do the clustering by WaveCluster to form different subclasses. For the incomplete record, judge the availability of incomplete records. Finally, use the weighted 1-NN method to find the nearest neighbor subclass of incomplete record in the complete record set, and fill the missing attribute value of incomplete record. The experiment demonstrated the proposed method is an appropriate and effective method in treatment of the incomplete data.On the basis of analyzing and studying many data cleaning framework, a data cleaning prototype system is designed, which has open algorithms library, rules library and assessment library. It contains plenty of cleaning algorithms and many cleaning rules and provides a wide range of quality assessment methods. From the analysis of the main functions of each module of system architecture and its application, it shows that the system has good extensibility, flexibility and interactivity.

Keywords/Search Tags:

data cleaning, data quality, approximately duplicated records, inner code's sequence value, incomplete data, cleaning system, extensibility

PDF Full Text Request

Related items

1	Some Main Technology's Research Of Data Cleaning
2	Research On Data Cleaning Of Approximately Duplicated Records
3	Research Of Data Cleaning Method Based On Data Warehouse
4	Research On Data Cleaning Method Based On Optimal Feature Selection
5	Research On Detection Of Approximate Duplicate Records For Massive Data
6	Research And Implementation Of Data Cleaning System Based On Pre-Processing Techniques
7	Research On Multi-source Heterogeneous Large Data Cleaning Technology Based On Machine Learning
8	Research On Data Cleaning Based On Science And Technology Innovation Big Data Public Platform
9	Research On Data Cleaning Technology With The Design And Implementation Of Data Cleaning Framework
10	Design And Implementation Of Customer Information Cleaning In CRM System