Font Size: a A A

Improvement Of The K-means Algorithm

Posted on:2013-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q TangFull Text:PDF
GTID:2268330392462642Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Now with the development of computer network and database technology, the informationgrows rapidly. How to find useful information has become a very important subject, and datamining emerged in this context. Data mining is to discover useful information or knowledgefrom a large number of data to help to make scientific decision. Cluster analysis is an importantanalytical tool in data mining and is an unsupervised classification method as well. The purposeof clustering analysis is to divide a data set into several clusters, making the data in the samecluster as similar as possible and the data in different clusters as different as possible. There aremany algorithms on cluster analysis, among which the K-means clustering is the most widelyused and most popular algorithm, with its simple idea and fast convergence. However, theK-means algorithm has many shortcomings: the clustering result is sensitive to the initialcenters; the value of k must be given before clustering. To solve the above problems, wepresent an improved K-means algorithm and a k-value learning algorithm in this paper.Improved K-means clustering algorithm is to solve the problem that the clustering results ofthe traditional K-means algorithm is sensitive to initial cluster centers and to improve theperformance of the algorithm at the same time. The improved algorithm is mainly used when thenumber of clusters is known, which can be run with no parameter but the value of k, inputted byuser. In the algorithm, first, we calculate the distance from origin to each data point, and thensort the data point’s according to their distances. At last, divide ding the sorted data points into kequal sets. In each set, we take the middle points as the initial centers. At last, we provide anefficient way of assigning the data points to suitable clusters with reduced time complexity.K-value learning algorithm is to solve the problem that the K-means algorithm cannotdetermine the value of k, which used genetic algorithm. Genetic algorithm is a kind of randomsearch method in which the process of biological evolution is simulated. According to thefitness function and through natural selection, cross restructuring, variation and other geneticoperations, it constantly updates the population to search the optimal solution. In the algorithm,we used binary coding, make each individual represents a k value. According to the fitnessfunction and constantly through genetic operation iteration we can find the optimal clusternumber. In the genetic algorithm, we used the adaptive crossover probability and mutationprobability to speed up the convergence of the algorithm. Finally, we do two experiments to test the effectiveness of improved K-means algorithmand k-value learning algorithm. In the first experiment, we run both the traditional K-meansalgorithm and improved K-means algorithm on five groups of data, and compare the clusteringresults and the time each algorithm cost. In the second one, we run k-value learning algorithmon above five groups of data. The experiment shows that the improved K-means algorithm canget better clustering results with less time, and k-value learning algorithm can find the bestcluster number at a high probability.
Keywords/Search Tags:cluster analysis, K-means algorithm, genetic algorithm, improved K-means algorithm
PDF Full Text Request
Related items