Research And Implementation Of Data Cleansing Based On Clustering Algorithm

Posted on:2009-10-27

Degree:Master

Type:Thesis

Country:China

Candidate:Y Zhang

Full Text:PDF

GTID:2178360242486967

Subject:Computer application technology

Abstract/Summary:

With the rapid development of information technology, managers more and more depend on data in decision analysis. Resource data, which includes many errors, is loaded and refreshed into data warehouse. So it is essential to clean the resource data before its going into the data warehouse. This paper firstly introduces data cleaning's basic knowledge, significance and current research and application both home and abroad. The theories, methods, evaluating standards and basic workflow of data cleaning is summarized and described. On the base of it, the algorithm which is named DBSCAN and based on clustering is improved and used in the fill of missing data when data is being cleaned, then the corresponding experiment is conducted on the public measurement data set, finally the result is compared with those of traditional experiments, and the compare proves an improved accuracy. At last, improved algorithm and experiments for cleaning repeated records in data warehouse are introduced.

Keywords/Search Tags:

data cleaning, clustering algorithm, missing data filling, repeated records elimination

Related items

1	Research On Data Cleaning Algorithm Based On Clustering
2	Towards Data-Mining: Data Cleaning Based On Clustering Techniques
3	Research On Data Cleaning Based On Clustering
4	Some Main Technology's Research Of Data Cleaning
5	The Research And Application Of Duplicated Records And Incomplete Data's Cleaning Approach
6	Data Cleansing In The Detection Of Similar Records
7	Researches On Data Elimination In Forestry WEB Yellow Page Information Integration
8	Research Of Large Amount Of Data In Chinese Commodity Cleaning Method Of The Algorithm Based On The SNM
9	Research On Missing Data Filling Method Based On Shared Knowledge
10	Research On Hybrid Algorithm Of Slope One Based On Predicating And Filling Missing-Data By Iterated