Font Size: a A A

Towards Data-Mining: Data Cleaning Based On Clustering Techniques

Posted on:2004-06-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y F TangFull Text:PDF
GTID:2168360122960424Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Many companies are using widely cheap information on the Internet for high profit. It causes data volume increasing tremendously. Thus it has already been a necessitous task that we can find out quickly interest patterns from large data set. Data mining provide many effective algorithms and techniques for solving this problem. However, these techniques are based on the hypothesis which source data is correct, correlation and consistent. Because data is usually dirty in real life, these techniques still have a long way to reach practical application. So data cleaning is an important task in data mining. It directly affect knowledge quality and data mining algorithm's effect.The research work of data cleaning can be divided roughly into two parts: ⑴correcting data errors, ⑵ integrating multi data sources for getting more complete information of real-life subject. Thereinto, information integration is an important process step in many research areas. During the period of integrating a large real-world dataset, we need to consider many key factors, such as data quality, correctness, consistency, completeness, and reliability. Unfortunately, data input and data acquisition are inherently prone to having errors. In summary, there are the following four aspects which cause data anomalies during the integrating process: ⑴ the absence of universal keys among different databases, i.e. the object identity problem, where the same entity has different ids, ⑵ the use of different data formats in different organizations, which causes many difficulties for integration, ⑶ large amount of input data, making it hard to be exempt from errors, and ⑷ inconsistent data. After combining multi-sources of information, there usually exist such cases, where different descriptions imply the same entity, namely the duplicate record problem, caused by different data schemata, different naming conventions, input errors, and inconsistent abbreviations. For the data consistency, all data sources should not have any duplicate records. Therefore, we must detect duplicate recordsand delete them.This paper discusses the important role of data cleaning process in many areas, and presents the current domestic and international research status about data cleaning. Meanwhile, this paper points out some problems of current data cleaning techniques, and proposes corresponding methods. Experimental results prove the effectiveness and accuracy of our algorithms.Our main work are summarized as follows:⑴ Analyzing current research situations of data cleaning, including existent problems.Currently, all data cleaning systems lack of pre-process step, which adds a heavy burden to the next data cleaning stage. We propose a method, by using an external source file, to clean dirty data. This method can handle some simple data errors and inconsistent data, avoiding inconsistent name abbreviation. It provides a better result data for the data cleaning process. Furthermore, we also put forward a new idea about how to transform data into many different database structures according to given demands.⑵ We use the canopy clustering technique, which is usually for large datasets, to match data records. In addition, we put forward a decision push-down method, which decreases largely the calculation time. Companies need to integrate multi-sources of data while they analyze business data and decision-making. During the integrating process, it is needed to recognize the same entity in different depiction formats, in order to get more complete information. We have conducted some researches on this area.⑶ We put forward a method, by using the canopy clustering technique, to cluster duplicate record, aiming at the current existing problems of duplicate records detection. Because a large database usually has some erroneous, inconsistent, and missing data, there inevitably exist some duplicate records. For the consistency, we must detect these duplicate records and delete them. Our method consists of two stages. The first stage uses an invert...
Keywords/Search Tags:data cleaning, data transformation, Canopy clustering technique, merge/purge problem, duplicate records problem
PDF Full Text Request
Related items