Font Size: a A A

Research And Implementation Of Parallel Clustering Ensemble Based On Linux Cluster

Posted on:2009-01-20Degree:MasterType:Thesis
Country:ChinaCandidate:Q Y DuFull Text:PDF
GTID:2178360242981569Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As the society coming into the information period, and thecomprehensive application of the computer network and computertechnology, the database in every industry accumulates substantive dataincreasingly. How to use these data and pick up useful information andknowledge from them to guide the production and distribution of theenterprises comes into being and develops a new computertechnology—Data Mining Technology which is widely used and hastremendous practicality.Data Mining (also called Knowledge Discovery in Databases) is atechnique that aims to extract implicit, previously unknown, and potentiallyuseful information out of large amounts of data sets. It has been viewed as animportant evolution in information technology. During the past over decade,new concepts and techniques on data mining have been presented, andespecially in the latest few years, some of them have been studied in higherlevels.Clustering analysis is an important research area of data mining and isan important method of data partition or grouping. Clustering has been usedin various ways including commerce, market analysis, biology, Webclassification and so on. Clustering algorithm is a non-supervised machinelearning algorithm. Different to the supervised learning algorithm, there is nounderstanding for the distribution of data sets prior to clustering, but its aimis to reveals the distribution of the true situation. In fact, any clusteringalgorithm has certain pre-assumptions for data sets itself, but if thedistribution of data sets doesn't accord with this hypothesis, the algorithmresults will be meaningless. So, faced with a specific application, how tochoose the appropriate clustering algorithm is an important topic of the studyof cluster analysis. In a supervised learning algorithm for the classification and regressionmodel, ensemble method is widely and successfully used. The method canovercome instability of classification, regression, and gives better results.However, the clustering algorithm is a non-supervised learning algorithm,and the distribution of the data sets without any prior knowledge, so theensemble method in classification and regression can not be directly appliedto clustering algorithms. This also led to the clustering ensemble methodstart late, which began a few years. The definition of Clustering ensemble isto merge different outcomes of partitioning on a group of objects, instead ofusing the characteristics of the original object. In recent years, Clusteringensemble method becomes hot research direction in the clustering area. Thisapproach to clustering algorithm has good stability and robustness.The characteristics of available data are too large and multi-dimensional,which makes the large-scale data mining application needs growing.Therefore, the high-performance parallel computing becomes the veryeffective tool in large measure to solve such problems. In addition, datamining also depend on the effectiveness of the available data sources. In thisway, developing effective parallel algorithms for various data miningtechnology appears to be particularly important.On the basis of study of the clustering ensemble methods, first of all,this thesis proposed clustering ensemble method based on the parallelK-medoids for the high complexity problem, which called CEPK-medoidsalgorithm. Although the K-means algorithm is certain scalability andparallelism, as well as efficiency, but it takes the mean value of cluster ascenter point, which makes algorithm very sensitive for noise and isolatedpoint. However, the K-medoids method taking center point object as thecluster center to identifies a cluster. This method is a relatively robust, beingable to better handle isolated data points. But this algorithm has very highcomplexity and of computation amount than the K-means, generally onlysuitable for small data, but may not have the obvious parallelism.Therefore, combine these two kinds of thinking, and take the data object nearest to the mean of cluster as center point. Such parallel K-medoidsalgorithm can get better efficiency, and also reduce the effects of isolatedpoints. So, propose CEPK-medoids which takes the parallel K-medoids asbasis clustering algorithm to generate clustering members.Although the parallel thinking was introduced into the CEPK-medoidsalgorithm, but only into the production course of each clustering member.And the overall production process are also belong to serial thinking, paralleldegree not enough as well. Therefore, this thesis analyzed further anddesigned another parallel clustering ensemble method with a higher degreeof parallel, named PCE (Parallel Clustering Ensemble).Given there are k clusters in the data set X, then the data in theseclusters will be assigned to each processor nodes. Each processor will obtainclustering results including several clusters after clustering according to thedata sets on it. And what we can imagine is that the cluster centers arerelatively more compact after the data belonging to the same class originallywas clustered respectively in various processors. Therefore, we name theclusters on each processor Cluster Points, and represent with cluster centers.Clustering these cluster points again, then the cluster points belonging to thesame class should be polymerization to the same cluster in theory, thus thefinal clustering result was obtained.Of course, there may be a certain degree of deflection in this process,but also the border between clusters may not be clear, which resulting in acertain inaccuracy. However, the clustering itself is the process without exactresult. But clustering ensemble is process of fusion multiple clusteringresults, and this process will eliminate the inaccuracy of some results.But such clustering process only can obtain one cluster membereventually. If each processor broadcast cluster points to other processorsrespectively, that is, each processor has all the cluster points, then p clustermembers will be obtained in parallel after clustering with these cluster pointsrespectively and p is the number of processors. If the serial algorithmrunning r times to obtain r cluster members, however only need to run ceiling of r/p times using this method. Thus, the degree of parallelism ofalgorithm has been enhanced and the algorithm efficiency will also beimproved.K-means and K-medoids algorithms only suitable for those found littledifference between the size of the spherical or convex clustering. But for adifference in size or arbitrary shape clustering data sets, they may divide anatural cluster into small clusters and can hardly be ideal clustering effect.Therefore, the PCE algorithm takes one-pass clustering algorithm as thebasis algorithm for producing clustering members. One-pass algorithm is aclustering algorithm based on minimum distance principle and combiningclustering threshold thinking, which only scan data sets one pass withapproximate linear time complexity.According to the limited laboratory conditions, has set up ahigh-performance computing clusters environment under Linux system.Through a series of experiments such as speedup, scalability and sizeup etc.on this platform showed that these two parallel ensemble strategies have amore stable clustering results, but also for the large-scale data sets has longerresponse time on single, both of them have better efficiency compare toserial algorithm.Looking at the experiment results of these parallel performanceindicators, the performance of PCE algorithm is not as good as that ofCEPK-medoids algorithm. This is because the limited experiment in thecurrent environment, with each node was connected to an internal switch;therefore, time synchronization between nodes is trivial. However, if thereare more nodes connected to different switches, then CEPK- medoidsalgorithm will be affected by synchronization time very much. On thecontrary, for the PCE algorithm, the greater p, the fewer the number ofsynchronization, and efficiency will be higher. However, for more frequentcommunication between process, the current PCE algorithm is not efficientthan CEPK-medoids algorithm. Improvement in this area will be done in thefuture, such as to increase the computing time compared to the proportion of communication time, reducing the number of communications, will be evengreater increase the efficiency of the algorithm.This paper is an expansion exploratory for clustering ensemble methodin parallel area. Although these two parallel strategies in relation to serialalgorithm enhances performance, but there is still some room forimprovement. Therefore, I will continue to study parallel clusteringensemble method, maximize parallelism as far as possible, and make full useof effective resources as well.
Keywords/Search Tags:Implementation
PDF Full Text Request
Related items