Font Size: a A A

Research And Implementation Of Data Cleaning System Based On Pre-Processing Techniques

Posted on:2008-10-30Degree:MasterType:Thesis
Country:ChinaCandidate:J X LiFull Text:PDF
GTID:2178360215997647Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technologies, enterprises have accumulated a lot of data after long-term operation. How to use these data and then to direct the decision analysis of enterprises is the key that they win and make maximal benefits.Decision support system is a kind of applications based on data, but the data with many qualititive problems result in false decisve analysis. So correcting the errors of dirty data plays central role of reduing the risk of wrong decision. Therefore, the study on data cleaning has gained its theoretical and practical value. In this dissertation, data cleaning is studied and a data cleaning system is designed.The technologies of data cleaning pre-processing and data cleaning are researched and analysed in this thesis. The study of data pre-prcessing focuses on the outlier detection and abbreviation-discovered. And the approximately duplicated records cleaning the key technology of data cleaning is studied. Outlier detection algorithm uses a distance-based method with local pruning, and a vertical data structure (Peano Count Tree, P-tree) is adopted to facilitate efficient outlier detection further. And we make two modifications on the algorithm. An abbreviation-discovered algorithm based on dynamic programming is studied. It can detect abbreviations in Chinese as well as in English. In order to clean approximately duplicated records, a synthetically cleaning method is given: A Multi-Pass Sorted-Neighborhood algorithm is being used to sort the dataset. Then this thesis analyses the methods that calculate the similarity between fields as well as similarity between records.And the combination of approximately duplicated records is discussed in the thesis. And we make modifications on each algorithm.Based on the research on data pre-processing and data cleaning, a data cleaning system is proposed. At the end of this dissertation, the results of experiments show that the data cleaning system and algorithms have good cleaning effect and high efficiency.
Keywords/Search Tags:data cleaning, data pre-processing, outlier detection, abbreviation discovery, approximately duplicated identification
PDF Full Text Request
Related items