Font Size: a A A

K-means Clustering Algorithm

Posted on:2011-05-21Degree:MasterType:Thesis
Country:ChinaCandidate:S JiangFull Text:PDF
GTID:2208360308967532Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Clustering is a fundamental problem that frequently arises in a great variety of fields such as pattern recognition, image processing, machine learning and statistics. In general, clustering is defined as the problem of finding homogeneous groups of samples in a given data set. Each of these groups is called a cluster and can be defined as a region in which the density of exemplars is locally higher than in other regions.The simplest form of clustering is partition clustering which aims at partitioning a given data set into disjoint subsets (clusters) so that specific clustering criteria are optimized. The most widely used criterion is the clustering error criterion which for each exemplar computes its squared distance from the corresponding cluster center and then sums these distances for all exemplars in data set. A popular clustering algorithm that minimizes the clustering error is the K-means algorithm. However, the K-means algorithm is a local search procedure. It suffers from some serious drawbacks that influence its performance.K-means clustering is the most popular clustering algorithm based on the partition of data. However, there are some shortcomings of it, such as its requiring a user to give out the number of clusters at first, and its sensitiveness to initial conditions, and its easily fall into the local solution et cetera.The K-means algorithm finds locally optimal solutions with respect to the clustering error. It is a fast iterative algorithm that has been used in many applications. It is a point-based clustering method that starts with the cluster centers initially placed at arbitrary positions and proceeds by moving at each step the cluster centers in order to minimize the clustering error. The main disadvantage of this method lies in its sensitivity to initial positions of cluster centers. Therefore, in order to obtain near optimal solutions using the K-means algorithm several runs must be scheduled differing in the initial positions of the cluster centers.The global K-means algorithm proposed by Likas et al is an incremental approach to clustering that dynamically adds one cluster center at a time through a deterministic global search procedure consisting of N (with N being the size of the data set) runs of the K-means algorithm from suitable initial positions. It avoids the depending on any initial conditions or parameters, and considerably outperforms the K-means algorithms, but it has a heavy computational load. In this paper, a new version of the global K-means algorithm is proposed. We improved the way of creating the next cluster center by introducing some idea of K-medoids clustering algorithm suggested by Park and Jun. Our new algorithm can not only reduce the computational load of the global K-means without affecting the performance of it, but also avoid the influence of the noisy data on clustering result. Our clustering algorithm is tested on some well-known data sets from UCI and on some synthetic data. The experiment results show that our method outperforms the global K-means algorithm.And then, a self-organizing feature map (SOFM) network is researched. The main investigation in this paper is designing a classifier with self-organizing feature map neural network and K-means algorithm. The SOFM network can project multi-dimensional data on a low-dimensional regular grid, so that it can be utilized to explore the potential properties of the large data. The characteristic of SOFM is its high speed, but the accuracy of it is not very good. K-means algorithm clusters nerve cells of SOFM which is a dynamic clustering algorithm for the small data. The characteristic of SOFM is its high accuracy, but the speed of it is not very well. In this paper, the learning process of clustering based on SOFM network was discussed. The problem of neighbor function and learning step's value in the process of weight coefficient self-organizing were analyzed. The specific algorithm to implement clustering was given based on SOFM. This algorithm implemented the functions of determining the number of clusters of K-means automatically.Form the above discussion we can say that in this paper we solved the two problems of K-means. One is its sensitivity to initial seeds and convergence to local solutions not the global solutions. The other is its need to determine the value of K for K-means in advance.
Keywords/Search Tags:Data mining, Clustering, K-means algorithm, Self-organizing feature map network
PDF Full Text Request
Related items