Font Size: a A A

Study Of Data Cleaning Algorithms Based On Data Warehouse

Posted on:2007-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:H N YangFull Text:PDF
GTID:2178360215495254Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
The rapid development of information technology makes organizational managers more and more dependent on data when they make decisions. On the foundation of database there appears data warehouse that can support decision analysis. But when different data sources are inputted into a data warehouse, many data quality problems may appear and lead to wrong analysis. So in order to improve data quality, the using of data cleaning process is strongly needed. Data cleaning is becoming an important topic in the field of data warehouse and data mining.In this paper, the knowledge of data cleaning is showed in detail, some relevant concepts and the current research situation at home and abroad are introduced. Important theories, methods, evaluation criteria and basic workflow of data cleaning are summarized. Especially, we focus on the techniques and algorithms of approximate duplicate records cleaning and an advanced algorithm is proposed.In approximate duplicate records cleaning, basic data cleaning knowledge and process are presented. Detail analysis of data cleaning algorithms is also given. The main work can be described as follows: In preprocess, on the basis of the thought that the more a field can cluster the same records and scatter different records, the bigger the field's weight should be, this paper gives a method to determine fields'weights. On the subject of field matching, this paper gives analysis of several common algorithms, for example, Levenshtein distance, Smith Waterman distance, Jaro Winkler distance and TI similarity. In clustering approximate duplicate records on database level, several methods based on"sort-merge"idea such as the SNM algorithm, MPN algorithm and priority queue algorithm are also discussed in detail. Advanced algorithm of the SNM is given and related experiments are done. Experimental results show that the improved SNM is better than the traditional one in time complexity when they have the same recall rate. Besides, this paper made efficiency and performance analysis on the priority queue and the clustering algorithm of canopy in the detection process of duplicate records.
Keywords/Search Tags:data cleaning, approximate duplicate records, field matching, recording matching
PDF Full Text Request
Related items