GPGPU-oriented Research On Parallel Incremental Clustering Algorithms

Posted on:2015-11-11

Degree:Doctor

Type:Dissertation

Country:China

Candidate:C L Chen

Full Text:PDF

GTID:1228330452965518

Subject:Control Science and Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of digital sensors, high performance computing andcommunication, and large-capacity data storage, enormous volumes of data are beinggenerated all the time in modern society. Machine learning is an effective method to analyzethe raw data and extract concerned information. Machine learning can be categorized assupervised learning and unsupervised learning. In contrast to the former, the latter does notrequire labeled training data, and thus can find much wider applications. Clustering analysis isa common solution to unsupervised learning.Classical clustering algorithms have already been successfully applied to many fields, andthey usually demand that the entire input data should be loaded into the host machineâ€™smemory. We name such algorithms the batch clustering algorithms. However, the bathclustering algorithm does not work efficiently under some application scenarios, includingtime-limited applications, memory-limited applications, redundant data detection andelimination, and so on. In these scenarios, the entire input is divided into a certain number ofsubsets; the clustering algorithm needs to proceed step by step, and obtains the entire inputâ€™sclustering result through processing the subsets one by one. An algorithm is called theincremental clustering algorithm, if it can accomplish such a task. Incremental clusteringalgorithms usually need to handle real-time or massive-data problems, and thus raise highdemand on the hardware platformâ€™s computing power. Parallel computing is a commonmethod to satisfy such demand. The General Purpose Graphic Processing Unit (GPGPU) is anovel parallel computing device, which enjoys many advantages. Only if enormous threads(the number of threads is significantly more than that of GPGPUâ€™s processing cores) aresimultaneously run in the SIMD (Single Instruction Multiple Data) mode, can the GPGPUâ€™scomputing power be fully utilized. An algorithm exhibits favorable GPGPU parallelism (i.e. itis GPGPU-friendly), if it can be executed in such a manner.In this dissertation, we focus on an interdiscipline of parallel computing and machinelearning. Our main targets are: first, identify what is really challenging in designing aGPGPU-friendly incremental clustering algorithm; second, design incremental algorithms thatcan achieve the balance between clustering accuracy and GPGPU parallelism; third, optimizethe algorithms based on CUDA (Compute Unified Device Architecture). Our maincontributions are listed as follows:1. It is identified that the existing incremental clustering algorithms are facing an accuracy-GPGPU parallelism dilemma. By summarizing the existing typical algorithms, we categorize the incremental clustering algorithms into two kinds: the block-wise and the point-wise. Wefind that the block-wise algorithms sacrifice accuracy for parallelism and the point-wisealgorithms barter parallelism for accuracy.2. Based on the Gaussian mixture model, we propose a top-down block-wise incrementalclustering algorithm (Top-Down Incremental Gaussian Mixture Model clustering algorithm,TDIGMM). TDIGMM evolves the clustering result in a top-down manner: the algorithm canapproximately pre-estimate the model order of the â€œso far seenâ€ data (including the newlyarrived data block), before it finishes processing the newly arrived data block; with this pre-estimated model order as a constraint, homologous data points can be identified in a moreelastic way. TDIGMM can significantly reduce the probability of same-to-different mis-affiliations (i.e. homologous data points are affiliated to different clusters). TDIGMM hasfavorable GPGPU parallelism, because its most time-consuming part is GPGPU-friendly.3.We present a nonparametric Evolving-Granularity-Centered incremental clusteringalgorithm (EGC). The evolving granularity of an incremental clustering algorithm is formallydefined. We point out that the evolving granularity is positively correlated to the probabilityof different to same mis-affiliations (i.e. heterologous data points are affiliated to the samecluster). Formal proof is provided for this conclusion. EGC tries to combine advantages ofboth the block-wise and point-wise algorithms, and evolves the clustering result in moderategranularity, i.e. in the granularity of a micro-cluster. Micro-clusters can be generated in aGPGPU-friendly manner; the latest clustering result is obtained through processing the micro-clusters one by one. The number of micro-clusters is dramatically smaller than that of datapoints; therefore, EGC decreases the amount of sequential operations and can remarkablyshorten the duration, within which the GPGPU is being underutilized.4. TDIGMM and EGC are optimized based on CUDA. With regards to a parallel algorithm,data transmission and access can occupy an indispensible proportion in the total executiontime under certain scenarios. Incremental clustering algorithms should especially payattention to this issue, because they are continuously loading incoming data. We leverage thepipeline pattern of CUDA to hide the data transmission latency. The data accessing operationsof TDIGMM and EGC are optimized through data pre-fetching and data reordering,respectively.Experiments validate the clustering accuracy and GPGPU parallelism of TDIGMM andEGC, as well as the performance of their optimizing methods.Finally, we conclude this dissertation and point out the future work.

Keywords/Search Tags:

incremental clustering, GPGPU, clustering accuracy, parallelism

PDF Full Text Request

Related items

1	Research On Dynamic Clustering And Incremental In Data Mining
2	Research On Incremental Clustering For Big Dataset
3	Research On Incremental Clustering Algorithm
4	The Research On The Method To Measure The Validity And To Abstract Knowledge Of Clustering
5	Design And Implementation Of Incremental Clustering Algorithm
6	Research And Application On Distributed Clustering And Incremental Clustering Based On DBSCAN
7	Clustering Algorithms Analysis On Data Dimension
8	The Research On Several Issues Of Clustering And Clustering Validity Indexes
9	A Research Of Clustering Method Based On Incremental Learning
10	Research On The Incremental Text Clustering Based On Cluster Features