Research And Implementation Of Data Cleaning System Based On Pre-Processing Techniques

Posted on:2008-10-30

Degree:Master

Type:Thesis

Country:China

Candidate:J X Li

Full Text:PDF

GTID:2178360215997647

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technologies, enterprises have accumulated a lot of data after long-term operation. How to use these data and then to direct the decision analysis of enterprises is the key that they win and make maximal benefits.Decision support system is a kind of applications based on data, but the data with many qualititive problems result in false decisve analysis. So correcting the errors of dirty data plays central role of reduing the risk of wrong decision. Therefore, the study on data cleaning has gained its theoretical and practical value. In this dissertation, data cleaning is studied and a data cleaning system is designed.The technologies of data cleaning pre-processing and data cleaning are researched and analysed in this thesis. The study of data pre-prcessing focuses on the outlier detection and abbreviation-discovered. And the approximately duplicated records cleaning the key technology of data cleaning is studied. Outlier detection algorithm uses a distance-based method with local pruning, and a vertical data structure (Peano Count Tree, P-tree) is adopted to facilitate efficient outlier detection further. And we make two modifications on the algorithm. An abbreviation-discovered algorithm based on dynamic programming is studied. It can detect abbreviations in Chinese as well as in English. In order to clean approximately duplicated records, a synthetically cleaning method is given: A Multi-Pass Sorted-Neighborhood algorithm is being used to sort the dataset. Then this thesis analyses the methods that calculate the similarity between fields as well as similarity between records.And the combination of approximately duplicated records is discussed in the thesis. And we make modifications on each algorithm.Based on the research on data pre-processing and data cleaning, a data cleaning system is proposed. At the end of this dissertation, the results of experiments show that the data cleaning system and algorithms have good cleaning effect and high efficiency.

Keywords/Search Tags:

data cleaning, data pre-processing, outlier detection, abbreviation discovery, approximately duplicated identification

PDF Full Text Request

Related items

1	Some Main Technology's Research Of Data Cleaning
2	Research On Data Cleaning Of Approximately Duplicated Records
3	The Research And Application Of Duplicated Records And Incomplete Data's Cleaning Approach
4	Research Of Data Cleaning Method Based On Data Warehouse
5	Research On Multi-source Heterogeneous Large Data Cleaning Technology Based On Machine Learning
6	Research On Detection Of Approximate Duplicate Records For Massive Data
7	Research On The Method Of Approximately Duplicated Records Detection For Text Data In Big Data Envitonment
8	Data Cleaning Algorithm And Applications
9	Research On Data Cleaning Method Based On Optimal Feature Selection
10	Similar Repetitive Record Detection Method In Uncertainty Database