| Data mining is through the analysis of the vast amounts of data to find out the potential,innovative, valuable information technology, and it has important applications in many areas.Cluster analysis is a data collection packet ground to the analysis of multiple classes ofsimilar objects and objects in each cluster has a lot of similarities. Thus cluster analysis has akey role in data mining. K-means algorithm is the most classic divide-based clusteringmethod, is one of the top ten classical data mining algorithms. Faced with the challenge ofmassive data, parallel computing is one of the most effective ways to solve massivecomputational problems. Provides a the class C language environment based on thedevelopment of the GPU CUDA language for developers, and provides a rich API enablesdevelopers to play better GPU parallel computing capabilities.This paper first analyzes the face of massive data problems and prospects of textclustering in cluster analysis. Then propose a CUDA-based dimension of the data as animportant reference for the parallel execution of the K-means algorithm. We used twodifferent strategies for low-dimensional and high-dimensional data sets. For low-dimensionaldata sets, we chose to register to reduce data access latency, and designed oriented parallelK-means algorithm for low-dimensional data sets. In the face of high-dimensional data sets,we designed strategy of the use of registers and shared memory to process the data setstogether, in order to obtain high computation and memory access and proposed the parallelK-means for high-dimensional data sets algorithm.We designed a heterogeneous parallel K-means text clustering system, including systemarchitecture design and features modular design. System function module is divided into fourparts: the user input module, low-dimensional data-oriented parallel K-means algorithmmodule, high-dimensional data-oriented parallel K-means algorithm module and outputmodule. Then we tested the system on the workstation with Nvidia Tesla C1060graphics,used the crawled news data as test data source, formatted data processing, and clusteringpurity standards as a measure of the algorithm accuracy. The experimental results showed thatthe proposed algorithm can quickly and accurately text clustering. The results of this studyshow that CUDA-based K-means text clustering algorithm is an effective way to improve the performance of text clustering. |