Research On Data Cleaning Algorithm Based On Clustering

Posted on:2016-09-19

Degree:Master

Type:Thesis

Country:China

Candidate:W Zou

Full Text:PDF

GTID:2348330542475740

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Now,with the era of big data coming,the volume of data has been growing fast.It is important to get a cleaning data for analyzing it well.Especially with a variety of ways to collect data,the most important stage which must be carried out is data cleansing.The incorrect ways of measurement,the constraints of extraction,the merging of different data sources and the manual input will all cause a large amount of missing data or duplicate records.The traditional methods are not enough for the application of these two areas.Thus,this paper proposed clustering-based data cleaning algorithms based on the current clustering algorithms.Firstly,the paper focus on two aspects in the field of data cleaning,filling in the missing data and deleting the duplicate data,and do a lot research on comparing algorithms in these two aspects.Since it is easy to apply data mining techniques to data cleaning,The paperdecide to use clustering to design data cleaning algorithm.Secondly,outlined the issue of cleaning missing values and discussed several filling methods of missing values.Do the research on how to apply the algorithms based on the density of DBSCAN to filling the missing values.Find DBSCAN algorithm is not suitable for filling the character missing values.Proposing the improved algorithm using DBSCAN and reusing association rules algorithm,and proved that this improved algorithm had an advantage in filling accuracy.Finally,to the problem of duplicate records has been study deeply.Gaven a measure of the similarity matching algorithm and existing fields.Then put forward the multi-table records matching matching algorithm to solve the problem,and do the experiment.The study found the effects of form clusters in the DBSCAN algorithm are effected for the parameter settings,that can cause duplicate record detection accuracy is not high.For this pick up a higher detection accuracy algorithm,and do the experiment to verify the feasibility of the algorithm.

Keywords/Search Tags:

Data cleaning, Missing value, Clustering, Duplicate records

PDF Full Text Request

Related items

1	Towards Data-Mining: Data Cleaning Based On Clustering Techniques
2	Research Of Data Cleansing Algorithms For Duplicate Records Detection Problem
3	Research On Duplicate Records Identification Model In Deep Web
4	Research And Implementation Of Data Cleansing Based On Clustering Algorithm
5	Design And Implementation Of Customer Information Cleaning In CRM System
6	Research On Detection Of Approximate Duplicate Records For Massive Data
7	Research On Data Cleaning Method Based On Optimal Feature Selection
8	Data Bank Data Warehouse Build Process Of Cleaning And VIP Clients Of The Excavation
9	Research Of Large Amount Of Data In Chinese Commodity Cleaning Method Of The Algorithm Based On The SNM
10	Study Of Data Cleaning Algorithms Based On Data Warehouse