Font Size: a A A

Research On Improved K-means Clustering Algorithm Based On Density

Posted on:2015-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y J JiaFull Text:PDF
GTID:2268330428462828Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the continuous development of information technologyand computer network, there are hundreds of millions ofdistributed information shared by people every day. How toretrieve the required information in these massive,heterogeneous data resources efficiently has become a problem,which is just the job of knowledge discovery (KDD) and datamining (DM).The most basic and important functions of the KDDand DM is text clustering, which has made considerabledevelopment these years.There are mainly five kinds of text clustering algorithmresearched recent years. They are: clustering algorithm based onpartition, such as the most commonly used K-means algorithm;clustering algorithm based on hierarchy; clustering algorithmbased on density, such as DBSCAN algorithm; grid basedclustering algorithm and clustering algorithm based on Model.K-means algorithm is the most classic text clustering algorithm based on partitioning algorithm. The main idea of theK-means algorithm is to randomly select K initial center points,then cluster the data to K crux according to the shortest distanceto the center point; update the cluster center by the mean of thedata in the cluster until the cluster center no longer changes.This algorithm has low time complexity, easy to implementation,and has better scalability in handling large data sets, so it can beused in many areas. But the disadvantages of the K-meansalgorithm are obvious. First, the algorithm is too sensitive to theinitial selected clustering center, the clustering results easilychanges with the initial center. Second, the K value for the lastcluster number constraint the cluster algorithm. In addition, thealgorithm can easily fall into local optimal solution.There are two main improvements to the traditional K-meansalgorithm in this paper. First, proposed an outliner detectionalgorithm based on the distance and statistical methods. Thesorted distance matrix between every data is calculated first,then selects the nearest and shortest distance in the matrix andfinds the maximum difference between nearest neighbordistance which will be used as the radius. Count the density ofdata points within the radius. The data points which density isless than the threshold density are detected by the statistically standard value to determine the future of these data points.Second, this paper proposed an initial center selectionalgorithm with a variable step size. It first calculates the averagedistance of the data points that have the minimum nearestdistances. The average distance established the variable stepsize which will be chosen from small to large. The data pointsthat are fastest between each other and have maximum densitywithin the radius will be the initial center be outputted.At last, this paper added the two algorithms above to thetraditional K-Means algorithm to form a new improved K-Meansclustering algorithm based on density. The new algorithm canavoid the traditional K-Means algorithm ended into local optimalsolution because of randomly initial center selection by theoutlier detection algorithm. And it can improve the quality ofexecution and the clustering efficiency by the maximum densitypoints selection for initial center algorithm. The improvedalgorithm is proved to have better clustering effect and betterclustering quality after testing the experimental data.
Keywords/Search Tags:K-means Algorithm, Variable Radius, Density Threshold, nearest Neighbor Distance
PDF Full Text Request
Related items