Study Of Data Cleaning Algorithms Based On Data Warehouse

Posted on:2007-02-17

Degree:Master

Type:Thesis

Country:China

Candidate:H N Yang

Full Text:PDF

GTID:2178360215495254

Subject:Pattern Recognition and Intelligent Systems

Abstract/Summary:

PDF Full Text Request

The rapid development of information technology makes organizational managers more and more dependent on data when they make decisions. On the foundation of database there appears data warehouse that can support decision analysis. But when different data sources are inputted into a data warehouse, many data quality problems may appear and lead to wrong analysis. So in order to improve data quality, the using of data cleaning process is strongly needed. Data cleaning is becoming an important topic in the field of data warehouse and data mining.In this paper, the knowledge of data cleaning is showed in detail, some relevant concepts and the current research situation at home and abroad are introduced. Important theories, methods, evaluation criteria and basic workflow of data cleaning are summarized. Especially, we focus on the techniques and algorithms of approximate duplicate records cleaning and an advanced algorithm is proposed.In approximate duplicate records cleaning, basic data cleaning knowledge and process are presented. Detail analysis of data cleaning algorithms is also given. The main work can be described as follows: In preprocess, on the basis of the thought that the more a field can cluster the same records and scatter different records, the bigger the field's weight should be, this paper gives a method to determine fields'weights. On the subject of field matching, this paper gives analysis of several common algorithms, for example, Levenshtein distance, Smith Waterman distance, Jaro Winkler distance and TI similarity. In clustering approximate duplicate records on database level, several methods based on"sort-merge"idea such as the SNM algorithm, MPN algorithm and priority queue algorithm are also discussed in detail. Advanced algorithm of the SNM is given and related experiments are done. Experimental results show that the improved SNM is better than the traditional one in time complexity when they have the same recall rate. Besides, this paper made efficiency and performance analysis on the priority queue and the clustering algorithm of canopy in the detection process of duplicate records.

Keywords/Search Tags:

data cleaning, approximate duplicate records, field matching, recording matching

PDF Full Text Request

Related items

1	Research On DBSCAN-Based Detection Method Of Approximate Duplicate Records
2	Design And Implementation Of Customer Information Cleaning In CRM System
3	Research On Detection Of Approximate Duplicate Records For Massive Data
4	Research Of Data Cleansing Algorithms For Duplicate Records Detection Problem
5	Research On Duplicate Records Identification Model In Deep Web
6	Data Cleaning Algorithm And Applications
7	Research On Data Cleaning Method Based On Optimal Feature Selection
8	Research Of Data Cleaning Method Based On Data Warehouse
9	Data Bank Data Warehouse Build Process Of Cleaning And VIP Clients Of The Excavation
10	Research Of Large Amount Of Data In Chinese Commodity Cleaning Method Of The Algorithm Based On The SNM