Towards Data-Mining: Data Cleaning Based On Clustering Techniques

Posted on:2004-06-08

Degree:Master

Type:Thesis

Country:China

Candidate:Y F Tang

Full Text:PDF

GTID:2168360122960424

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Many companies are using widely cheap information on the Internet for high profit. It causes data volume increasing tremendously. Thus it has already been a necessitous task that we can find out quickly interest patterns from large data set. Data mining provide many effective algorithms and techniques for solving this problem. However, these techniques are based on the hypothesis which source data is correct, correlation and consistent. Because data is usually dirty in real life, these techniques still have a long way to reach practical application. So data cleaning is an important task in data mining. It directly affect knowledge quality and data mining algorithm's effect.The research work of data cleaning can be divided roughly into two parts: â‘´correcting data errors, â‘µ integrating multi data sources for getting more complete information of real-life subject. Thereinto, information integration is an important process step in many research areas. During the period of integrating a large real-world dataset, we need to consider many key factors, such as data quality, correctness, consistency, completeness, and reliability. Unfortunately, data input and data acquisition are inherently prone to having errors. In summary, there are the following four aspects which cause data anomalies during the integrating process: â‘´ the absence of universal keys among different databases, i.e. the object identity problem, where the same entity has different ids, â‘µ the use of different data formats in different organizations, which causes many difficulties for integration, â‘¶ large amount of input data, making it hard to be exempt from errors, and â‘· inconsistent data. After combining multi-sources of information, there usually exist such cases, where different descriptions imply the same entity, namely the duplicate record problem, caused by different data schemata, different naming conventions, input errors, and inconsistent abbreviations. For the data consistency, all data sources should not have any duplicate records. Therefore, we must detect duplicate recordsand delete them.This paper discusses the important role of data cleaning process in many areas, and presents the current domestic and international research status about data cleaning. Meanwhile, this paper points out some problems of current data cleaning techniques, and proposes corresponding methods. Experimental results prove the effectiveness and accuracy of our algorithms.Our main work are summarized as follows:â‘´ Analyzing current research situations of data cleaning, including existent problems.Currently, all data cleaning systems lack of pre-process step, which adds a heavy burden to the next data cleaning stage. We propose a method, by using an external source file, to clean dirty data. This method can handle some simple data errors and inconsistent data, avoiding inconsistent name abbreviation. It provides a better result data for the data cleaning process. Furthermore, we also put forward a new idea about how to transform data into many different database structures according to given demands.â‘µ We use the canopy clustering technique, which is usually for large datasets, to match data records. In addition, we put forward a decision push-down method, which decreases largely the calculation time. Companies need to integrate multi-sources of data while they analyze business data and decision-making. During the integrating process, it is needed to recognize the same entity in different depiction formats, in order to get more complete information. We have conducted some researches on this area.â‘¶ We put forward a method, by using the canopy clustering technique, to cluster duplicate record, aiming at the current existing problems of duplicate records detection. Because a large database usually has some erroneous, inconsistent, and missing data, there inevitably exist some duplicate records. For the consistency, we must detect these duplicate records and delete them. Our method consists of two stages. The first stage uses an invert...

Keywords/Search Tags:

data cleaning, data transformation, Canopy clustering technique, merge/purge problem, duplicate records problem

PDF Full Text Request

Related items

1	Research Of Data Cleansing Algorithms For Duplicate Records Detection Problem
2	Research On Data Cleaning Algorithm Based On Clustering
3	Design And Implementation Of Data Preprocessing System Oriented To Data Mining
4	Research On Duplicate Records Identification Model In Deep Web
5	Rule-Based Interactive Data Cleaning Technique
6	Data Bank Data Warehouse Build Process Of Cleaning And VIP Clients Of The Excavation
7	Design And Implementation Of Customer Information Cleaning In CRM System
8	Researches On Data Elimination In Forestry WEB Yellow Page Information Integration
9	Research On Detection Of Approximate Duplicate Records For Massive Data
10	Research Of Large Amount Of Data In Chinese Commodity Cleaning Method Of The Algorithm Based On The SNM