Parallel Computing Design And Implementation For Dirichlet Algorithm Based On GPU

Posted on:2014-01-30

Degree:Master

Type:Thesis

Country:China

Candidate:M S He

Full Text:PDF

GTID:2248330398470755

Subject:Computer Science and Technology

Abstract/Summary:

In recent years, the information technology popularization and the rapid development of the hardware technology have made the preconditions for the generation and storage of information in big data. All areas in commercial, scientific research institutions, government departments, etc. are stored large amounts of data. And how to extract useful information from these large data sets has become a theme of our growing concern, data mining is also get our attention in this context and which has developed rapidly. Clustering as an important tool for data mining, which is the process dividing the similar objects into the same group and classifying different objects into divergent group, has been widely used in various fields.This paper first introduces the basic theory of data mining and cluster analysis, with an emphasis on the Dirichlet mixture model clustering. We then studied the Dirichlet process mixture model algorithm and its concrete realization based on the Apache Mahout machine learning libraries. The model is a Bayesian mixture model with Dirichlet process prior. Mahout provides an in-memory implementation and MapReduce implementation, the paper mainly studies the latter. This paper uses multi-group data set as the input to the algorithm to study Dirichlet process clustering algorithm. We get the conclusion that the overhead of the algorithm is concentrated in the map function through the running results analysis. This paper also studied the GPU (graphics processor unit),and proposed an improved scheme in parallel GPU to improve the efficiency of the algorithm. The paper studied the GPU architecture and its advantages, as well as the CUDA parallel programming, then realized the scheme through JNI recalling the CUDA program based on the Dirichlet process implementation in Mahout, in which CUDA program handling the map function in parallel. Finally, the paper analyzed the results with the same data input. Compared the performances between the source codes and improved ones, we drew the conclusion that the improved program enhanced efficiency of the algorithm and the larger of the data is, the more the improvement is obvious. All of these can provide a useful reference to the performance research in data mining algorithm.

Keywords/Search Tags:

data mining, model clustering, Dirichlet Process, GPU, Mahout

Related items

1	Review Clustering Using Dirichlet Process Multinomial Mixture Models
2	Research On Deep Web Sources Clustering Based On Dirichlet Process
3	A Clustering Method Based On Sticky Hierarchical Dirichlet Process And Its Application
4	Research Of Clustering Algorithm Based On Mahout
5	Topic Model For Graph Mining Based On Hierarchical Dirichlet Process
6	Research And Implementation Of Distributed Topic Clustering Technology For Text Flow
7	Design And Implementation Of The Data Mining Platform Based On Mahout
8	Research And Application Of Books Intelligent Retrieval System Based On Mahout Model
9	Topic Model Based On Dirichlet Process
10	Model-based Algorithms For Text Clustering