Font Size: a A A

Research On Data Cleaning Algorithm Based On Clustering

Posted on:2016-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:W ZouFull Text:PDF
GTID:2348330542475740Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Now,with the era of big data coming,the volume of data has been growing fast.It is important to get a cleaning data for analyzing it well.Especially with a variety of ways to collect data,the most important stage which must be carried out is data cleansing.The incorrect ways of measurement,the constraints of extraction,the merging of different data sources and the manual input will all cause a large amount of missing data or duplicate records.The traditional methods are not enough for the application of these two areas.Thus,this paper proposed clustering-based data cleaning algorithms based on the current clustering algorithms.Firstly,the paper focus on two aspects in the field of data cleaning,filling in the missing data and deleting the duplicate data,and do a lot research on comparing algorithms in these two aspects.Since it is easy to apply data mining techniques to data cleaning,The paperdecide to use clustering to design data cleaning algorithm.Secondly,outlined the issue of cleaning missing values and discussed several filling methods of missing values.Do the research on how to apply the algorithms based on the density of DBSCAN to filling the missing values.Find DBSCAN algorithm is not suitable for filling the character missing values.Proposing the improved algorithm using DBSCAN and reusing association rules algorithm,and proved that this improved algorithm had an advantage in filling accuracy.Finally,to the problem of duplicate records has been study deeply.Gaven a measure of the similarity matching algorithm and existing fields.Then put forward the multi-table records matching matching algorithm to solve the problem,and do the experiment.The study found the effects of form clusters in the DBSCAN algorithm are effected for the parameter settings,that can cause duplicate record detection accuracy is not high.For this pick up a higher detection accuracy algorithm,and do the experiment to verify the feasibility of the algorithm.
Keywords/Search Tags:Data cleaning, Missing value, Clustering, Duplicate records
PDF Full Text Request
Related items