K-means Algorithm And Parallelization

Posted on:2004-10-26

Degree:Master

Type:Thesis

Country:China

Candidate:J L Mao

Full Text:PDF

GTID:2168360122475530

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

Data Mining, also called as knowledge discovery of databases (KDD), is a processing procedure of extracting credible, novel, effective and understandable patterns from databases. As a rising crossover subject, data mining involves an integration of techniques from multiple disciplines such as machine learning, pattern recognition, database technology, statistics and artificial intelligence. Cluster analysis, an important technology in data mining, which groups the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. K-means algorithm is a popular partition method in cluster analysis, the most widely used clustering criterion in it is the sum of the squared Euclidean distances between each data point xk and the centroid mj of the subset which contains xk. It is relatively scalable and high efficient for disposing big data set, and have the latent data-parallelism quality. However, this algorithm heavily depends on the sequence of data-inputting and easily gets into local optimum for random selecting initial cluster centers. In addition to that, when applying this square-error criterion to evaluate the clustering results, the objects in one cluster will be divided into two or more clusters for attempting to minimize the value of it. Aiming at the dependency to initial conditions and the limitation of K-means algorithm that applies the square-error criterion to measure the quality of clustering, this paper presents a new improved K-means algorithm that is based on effective techniques of multi-sampling and once-clustering to search the optimal initial values of centers. To make the algorithms more efficient, we have parallelized it on the PC-cluster system, which consists of PC(connected by LAN), PVM and LINUX. Master/Slave is adopted as parallel program model. By distributing data sets to slave nodes to realize the data-parallel, in the end, the master node gathers clustering results from each slave node and comes up with the final result. In practical experiment, we adopt an optimization tactic of data-distribution in consideration of K-means algorithm's intrinsic characteristic and disposal capability of each PC node. The article evaluates the parallel algorithm, with the criterion of time complexity and speedup gets from theoretical study and experiment result. Our experiments demonstrate that the new improved algorithm can obtain better stability than the original K-means in clustering results, and parallel algorithm can upgrade the efficiency of serial algorithm.

Keywords/Search Tags:

Data mining, Clustering, K-means, PVM, Parallel Algorithm

PDF Full Text Request

Related items

1	A Fast And Efficient Parallel Bisecting K-Means Algorithm
2	Scmi-superviscd K-means Clustering Algorithm In Data Mining
3	K-means Algorithm And Parallelization
4	Visual data mining: Using parallel coordinate plots with K-means clustering and color to find correlations in a multidimensional dataset
5	The Encrypted Block Cipher Algorithm Based On Gpu Parallel Clustering Research And Implementation
6	The Research Of Clustering Data Mining Based On Swarm Intelligence Algorithm
7	Research On Parallel Optimization Of Clustering Algorithms In Data Mining
8	Clustering Data Mining Applications In Department Store And K-means Clustering Algorithm Improvement
9	Research And Improvement Of K - Means Clustering Algorithm
10	Fuzzy C-means And K-means Clustering Algorithm And Its Parallel