Font Size: a A A

The Study, Distance-based Clustering And Outlier Detection

Posted on:2006-04-17Degree:MasterType:Thesis
Country:ChinaCandidate:J P ShangFull Text:PDF
GTID:2208360155969229Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining techniques can be used to find out potential and useful knowledge from the vast amount of data, and it plays a new significant role to the stored data in the info-times. With the rapid development of the data mining techniques, clustering analysis and outlier detection, as important parts of data mining, are widely applied to the fields such as pattern recognition, data analysis, image processing, and market research. Research on clustering analysis and outlier detection algorithms has become a highly active topic in the data mining research.In this thesis, the author presents the theory of data mining, and deeply analyzes the algorithms of clustering and outliers detection. Based on the analysis of distance-based and density-based clustering algorithm, the author advances Distance-Based Clustering and Outlier Detection algorithm (DBCOD), elaborates the idea of the algorithm, expounds the functions of algorithm, and designs program flow charts. The DBCOD algorithm records the datum points by distance threshold, counts the density of every datum point in clustering, identifies outliers by density threshold, determinates valid cluster and outlier cluster by the number of datum points in it. The computational complexity of the DBCOD algorithm is 0(n2) and the spatial complexity of the algorithm is 0(n), where n is the number of dataset objects. In this thesis, we have developed DBCOD algorithm and implemented it using Visual C++ 6.0. For contrast experiments, k-means and DBSCAN algorithms are also implemented using Visual C++ 6. 0. We conducted a series of experiments, including the experiment of the correctness of clustering and outlier detection, the experiment of the precision of clustering and outlier detection. The experiment of the runtime, the experiment of the effect of clustering and outlier detection on parameters, the experiment of the impact of clustering and outlier detection precision by the order of data input, and the experiment of the effect of the algorithm validness by the density character of dataset.As shown in the experimental results, DBCOD algorithm can not only cluster the dataset properly but test outliers in the dataset, and it effectively solves the problem that traditional algorithms can cluster only or find outliers only; the precision of DBCOD algorithm is better than that of k-means; its efficiency is higher than that of DBSCAN; it works well for even density dataset and high density datasets; it can discover clusters of arbitrary shapes; it is sensitive not to noise and outlier data but to parameter values; but it is imperfect to cluster and find outliers in multi-density dataset.To sum up, the DBCOD algorithm can find clusters and outliers accurately and validly, and the algorithm has superiorities in the efficiency and precision of clusters and outliers.
Keywords/Search Tags:Data Mining, Clustering Algorithms, Outlier Detection, Distance, Density
PDF Full Text Request
Related items