Font Size: a A A

Research On Parallelization Of Data Mining Algorithm Based On GPU

Posted on:2019-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:J F ZhengFull Text:PDF
GTID:2428330596964813Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rise of the Internet technology and the popularity of computers,people have generated a large amount of data in various fields.In order to extract valuable information from these data,data mining technology has emerged,among which the clustering technology and Classification technology has always been the focus of people's research.With the continuous accumulation of data,traditional CPU-based mining algorithms have encountered performance bottlenecks,and the rapid development of computer graphics processing units(GPUs)provides a good platform for people to use GPUs for general-purpose computing.GPU-based high-performance parallel computing has become a hot topic of research.The data mining k-means clustering algorithm and KNN classification algorithm have a wide range of applications in various fields,but when faced with massive data,k-means algorithm has two main problems.First,There are many data points in each iteration of the algorithm and no change occurs in the clustering center,resulting in a large number of redundant calculations.Secondly,in the face of high-dimensional large data volume,the calculation of the distance from the data point to the cluster center and the update of the cluster center are costly,and the required time often increases exponentially.KNN algorithm also has two problems.First,it is a lazy learning algorithm.When the training data set has a large capacity and a high dimension,the distance calculation is large.Second,the distance sorting phase of each test data has a high time complexity and consumes a lot of time.This paper analyzes the parallelism of k-means algorithm and KNN algorithm.Based on the existing problems of the two algorithms,the GPU-based parallel computing is introduced into the algorithm.The main contributions are as follows:1.A GPU-based k-means algorithm called GS_k-means algorithm is proposed.The algorithm first solves the problem of redundant calculations by filtering out data that does not result in clustering changes in this iteration through a GPU-based global selector.Then the universal matrix multiplication based on Cublas library is used to speed up the calculation of the distance from each data point to all cluster centers.Finally,the clustering center is updated by the method based on the same tag grouping to improve the parallelism of the algorithm.2.A GPU-based KNN algorithm called GS_KNN algorithm is proposed.The algorithm first multiplies the distance between the acceleration test data and the training data based on Cublas library's general matrix,and then proposes two optimization strategies based on the value of k in the distance sorting stage,which are respectively based on k-valued minimum search and double-ordered sorting.Neighbors choose to speed up the sorting.Finally,they are also ported to the GPU based on the atomic addition operation to parallelize the implementation of statistical labeling,thus improving the efficiency of the algorithm.The above two algorithms are compared with the existing algorithms.The experimental results show that the improved algorithm proposed in this paper effectively improves the efficiency of the algorithm,reduces the execution time,and demonstrates the feasibility and efficiency of the algorithm.
Keywords/Search Tags:data mining, GPU, parallel computing, k-means, KNN
PDF Full Text Request
Related items