Font Size: a A A

Accelerating Clustering Algorithm On The Cuda Graphics Processor

Posted on:2014-10-02Degree:MasterType:Thesis
Country:ChinaCandidate:R J MaFull Text:PDF
GTID:2348330503452526Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The size of various data sets has increased tremendously in recent years as speedups in processing and communication have greatly improved the capability for data generation and collection in a lot of fields.The traditional serializing data mining method is hard to afford to process these huge size data sets.As an evolutionary solution,parallelization of data mining will be more efficient.Currently,the multi-core CPU(Central Processing Unit)is pervasively adopted parallel data mining.With the growth of scale of data set and the more complicated of the used algorithms in data mining,the execution time by this high density mathematical computing will be too large and it will decrease the performance of the whole system.According to the above difficulty,people should look for a more dedicated processing unit to handle this computing task to relieve the load of the system.The graphic processing unit---GPU(Graphic Processing Unit),because of its special architecture,it's very appropriate for the parallel computing for huge scale data sets.For conquering the strict requirement in field of graphic processing,the GPU designer allocates more silicon area for arithmetic computing and tries to achieve higher memory bandwidth.Especially,after NVIDIA released the CUDA(Compute Unified Device Architecture)GPU and related IDE in 2007,it decreases the difficulty of the programming on GPU largely.It makes the developers can command the programming on CUDA GPU quickly.At present,more and more developer from various fields,such as scientific computing,financial engineering,data mining and so on,is trying to use GPU to improve the performance of the their computing system.Except utilization of GPU in single-node,more and more developers try to adopt it in the distributed computing environment.In this paper,we are trying to achieve the K-means clustering and Single-linkage agglomerative hierarchical clustering which are two of the most popular clustering algorithms in data mining by the computing unit from NVIDIA GTX 260 family.Then we want to demonstrate the feasibility of GPU's utilization in clustering algorithm.After that,for development of CRM in our company,we try to achieve the GPU computing in Hadoop framework.As we known,the clustering is a very important step,which is always used in some data mining.It aims to cluster the scattered objects into several clusters based on the agreed similarity.The k-means clustering algorithm achieved in our experiment is clustering the N nodes into the cluster with which the Euclidean distance is the shortest.After several times of iteration the N nodes will be clustered into K clusters.As a classic clustering algorithm,the K-means clustering is widely used in data mining,Bioinformatics,image recognition,AI and so on.For an actual instance,K-means clustering is used in Apache Mahout.As to another clustering algorithm achieved in this paper,hierarchical clustering,we will initialize the N objects as N sub-clusters firstly.The similarity among the sub-clusters is measured by Euclidean distance.Then we will merge the sub-clusters between which the distance is the shortest one into one cluster.After several iterations,when all of the objects are includes in a same cluster,the execution of the algorithm ends up.Although the operations and the target of the two clustering algorithms are different,we find the generality of the two clustering algorithms is that the computation of large scale of data objects is existed in both algorithms.Because of this generality,we can utilize the multi-core CPU and GPU to achieve the parallel computing for the clustering algorithms.It largely decrease the consuming time and enlarge the throughput of the data objects to be processed.Because company's CRM will running on a distributed computing environment,we select the Hadoop software framework as a target platform.It allows users can achieve the distributed computing of large scale of data by cluster of commodity computers.So far,Hadoop has a lot of users,such as Google,Facebook and so on.In our experiment,we will use three groups of different scale of data set as the input data for k means clustering.Because of limited hardware resource,we will use a smaller scale of data set for hierarchical clustering.According to the CUDA programming model,we separate the code into two sections,one is run on the CPU,and the other part about the high density computing is executed by GPU.This model of CPU+GPU is also called “heterogeneous computing”.As a comparison,we also will use the same input data to run on the pure multi-core CPU platform to get the speed-up ratio.At last,we achieve the GPU clustering on distributed computing by Hadoop,and all of the methods and related result will be used in the development of company's CRM.
Keywords/Search Tags:Multi-core CPU, Parallel Computing, Data Mining, CUDA, GPU, K-means, Single-linkage, Hierarchical clustering, Hadoop, MapReduce
PDF Full Text Request
Related items