In recent years, the information technology popularization and the rapid development of the hardware technology have made the preconditions for the generation and storage of information in big data. All areas in commercial, scientific research institutions, government departments, etc. are stored large amounts of data. And how to extract useful information from these large data sets has become a theme of our growing concern, data mining is also get our attention in this context and which has developed rapidly. Clustering as an important tool for data mining, which is the process dividing the similar objects into the same group and classifying different objects into divergent group, has been widely used in various fields.This paper first introduces the basic theory of data mining and cluster analysis, with an emphasis on the Dirichlet mixture model clustering. We then studied the Dirichlet process mixture model algorithm and its concrete realization based on the Apache Mahout machine learning libraries. The model is a Bayesian mixture model with Dirichlet process prior. Mahout provides an in-memory implementation and MapReduce implementation, the paper mainly studies the latter. This paper uses multi-group data set as the input to the algorithm to study Dirichlet process clustering algorithm. We get the conclusion that the overhead of the algorithm is concentrated in the map function through the running results analysis. This paper also studied the GPU (graphics processor unit),and proposed an improved scheme in parallel GPU to improve the efficiency of the algorithm. The paper studied the GPU architecture and its advantages, as well as the CUDA parallel programming, then realized the scheme through JNI recalling the CUDA program based on the Dirichlet process implementation in Mahout, in which CUDA program handling the map function in parallel. Finally, the paper analyzed the results with the same data input. Compared the performances between the source codes and improved ones, we drew the conclusion that the improved program enhanced efficiency of the algorithm and the larger of the data is, the more the improvement is obvious. All of these can provide a useful reference to the performance research in data mining algorithm. |