Font Size: a A A

K-means Algorithm And Parallelization

Posted on:2004-10-26Degree:MasterType:Thesis
Country:ChinaCandidate:J L MaoFull Text:PDF
GTID:2168360122475530Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Data Mining, also called as knowledge discovery of databases (KDD), is a processing procedure of extracting credible, novel, effective and understandable patterns from databases. As a rising crossover subject, data mining involves an integration of techniques from multiple disciplines such as machine learning, pattern recognition, database technology, statistics and artificial intelligence. Cluster analysis, an important technology in data mining, which groups the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. K-means algorithm is a popular partition method in cluster analysis, the most widely used clustering criterion in it is the sum of the squared Euclidean distances between each data point xk and the centroid mj of the subset which contains xk. It is relatively scalable and high efficient for disposing big data set, and have the latent data-parallelism quality. However, this algorithm heavily depends on the sequence of data-inputting and easily gets into local optimum for random selecting initial cluster centers. In addition to that, when applying this square-error criterion to evaluate the clustering results, the objects in one cluster will be divided into two or more clusters for attempting to minimize the value of it. Aiming at the dependency to initial conditions and the limitation of K-means algorithm that applies the square-error criterion to measure the quality of clustering, this paper presents a new improved K-means algorithm that is based on effective techniques of multi-sampling and once-clustering to search the optimal initial values of centers. To make the algorithms more efficient, we have parallelized it on the PC-cluster system, which consists of PC(connected by LAN), PVM and LINUX. Master/Slave is adopted as parallel program model. By distributing data sets to slave nodes to realize the data-parallel, in the end, the master node gathers clustering results from each slave node and comes up with the final result. In practical experiment, we adopt an optimization tactic of data-distribution in consideration of K-means algorithm's intrinsic characteristic and disposal capability of each PC node. The article evaluates the parallel algorithm, with the criterion of time complexity and speedup gets from theoretical study and experiment result. Our experiments demonstrate that the new improved algorithm can obtain better stability than the original K-means in clustering results, and parallel algorithm can upgrade the efficiency of serial algorithm.
Keywords/Search Tags:Data mining, Clustering, K-means, PVM, Parallel Algorithm
PDF Full Text Request
Related items