Research On Improved K-means Clustering Algorithm Based On Density

Posted on:2015-01-04

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Jia

Full Text:PDF

GTID:2268330428462828

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the continuous development of information technologyand computer network, there are hundreds of millions ofdistributed information shared by people every day. How toretrieve the required information in these massive,heterogeneous data resources efficiently has become a problem,which is just the job of knowledge discovery (KDD) and datamining (DM).The most basic and important functions of the KDDand DM is text clustering, which has made considerabledevelopment these years.There are mainly five kinds of text clustering algorithmresearched recent years. They are: clustering algorithm based onpartition, such as the most commonly used K-means algorithm;clustering algorithm based on hierarchy; clustering algorithmbased on density, such as DBSCAN algorithm; grid basedclustering algorithm and clustering algorithm based on Model.K-means algorithm is the most classic text clustering algorithm based on partitioning algorithm. The main idea of theK-means algorithm is to randomly select K initial center points,then cluster the data to K crux according to the shortest distanceto the center point; update the cluster center by the mean of thedata in the cluster until the cluster center no longer changes.This algorithm has low time complexity, easy to implementation,and has better scalability in handling large data sets, so it can beused in many areas. But the disadvantages of the K-meansalgorithm are obvious. First, the algorithm is too sensitive to theinitial selected clustering center, the clustering results easilychanges with the initial center. Second, the K value for the lastcluster number constraint the cluster algorithm. In addition, thealgorithm can easily fall into local optimal solution.There are two main improvements to the traditional K-meansalgorithm in this paper. First, proposed an outliner detectionalgorithm based on the distance and statistical methods. Thesorted distance matrix between every data is calculated first,then selects the nearest and shortest distance in the matrix andfinds the maximum difference between nearest neighbordistance which will be used as the radius. Count the density ofdata points within the radius. The data points which density isless than the threshold density are detected by the statistically standard value to determine the future of these data points.Second, this paper proposed an initial center selectionalgorithm with a variable step size. It first calculates the averagedistance of the data points that have the minimum nearestdistances. The average distance established the variable stepsize which will be chosen from small to large. The data pointsthat are fastest between each other and have maximum densitywithin the radius will be the initial center be outputted.At last, this paper added the two algorithms above to thetraditional K-Means algorithm to form a new improved K-Meansclustering algorithm based on density. The new algorithm canavoid the traditional K-Means algorithm ended into local optimalsolution because of randomly initial center selection by theoutlier detection algorithm. And it can improve the quality ofexecution and the clustering efficiency by the maximum densitypoints selection for initial center algorithm. The improvedalgorithm is proved to have better clustering effect and betterclustering quality after testing the experimental data.

Keywords/Search Tags:

K-means Algorithm, Variable Radius, Density Threshold, nearest Neighbor Distance

PDF Full Text Request

Related items

1	Study On Generalized Nearest Neighbor Pattern Classification
2	The Research And Implementation Of Density-based Clustering Algorithm With Pattern Evaluation Methods
3	Research On Gravitation-Based Density Peaks Clustering Algorithm And Its Application
4	Research On Recommendation Algorithm Based On Improved K-means Algorithm And SVD Algorithm
5	Research On Cluster Center Optimization Of K-means Algorithm
6	Improved K-Means Algorithm Based On K Nearest Neighbor And Density Peak Algorithm
7	Research And Application Of Financial Big Data Based On Density Peak Clustering Of K Near Neighbors
8	Stacked Hashing Quantization Algorithm For Nearest Neighbor Search
9	Research On Spectral Clustering Algorithm Based On Nearest Neighbor Analysis
10	Grid-based Clustering Algorithm Analysis And Research