CUDA-based Parallel K-Means Algorithm Of Text Clustering

Posted on:2013-12-29

Degree:Master

Type:Thesis

Country:China

Candidate:Q Wang

Full Text:PDF

GTID:2248330395959611

Subject:Software engineering

Abstract/Summary:

Data mining is through the analysis of the vast amounts of data to find out the potential,innovative, valuable information technology, and it has important applications in many areas.Cluster analysis is a data collection packet ground to the analysis of multiple classes ofsimilar objects and objects in each cluster has a lot of similarities. Thus cluster analysis has akey role in data mining. K-means algorithm is the most classic divide-based clusteringmethod, is one of the top ten classical data mining algorithms. Faced with the challenge ofmassive data, parallel computing is one of the most effective ways to solve massivecomputational problems. Provides a the class C language environment based on thedevelopment of the GPU CUDA language for developers, and provides a rich API enablesdevelopers to play better GPU parallel computing capabilities.This paper first analyzes the face of massive data problems and prospects of textclustering in cluster analysis. Then propose a CUDA-based dimension of the data as animportant reference for the parallel execution of the K-means algorithm. We used twodifferent strategies for low-dimensional and high-dimensional data sets. For low-dimensionaldata sets, we chose to register to reduce data access latency, and designed oriented parallelK-means algorithm for low-dimensional data sets. In the face of high-dimensional data sets,we designed strategy of the use of registers and shared memory to process the data setstogether, in order to obtain high computation and memory access and proposed the parallelK-means for high-dimensional data sets algorithm.We designed a heterogeneous parallel K-means text clustering system, including systemarchitecture design and features modular design. System function module is divided into fourparts: the user input module, low-dimensional data-oriented parallel K-means algorithmmodule, high-dimensional data-oriented parallel K-means algorithm module and outputmodule. Then we tested the system on the workstation with Nvidia Tesla C1060graphics,used the crawled news data as test data source, formatted data processing, and clusteringpurity standards as a measure of the algorithm accuracy. The experimental results showed thatthe proposed algorithm can quickly and accurately text clustering. The results of this studyshow that CUDA-based K-means text clustering algorithm is an effective way to improve the performance of text clustering.

Keywords/Search Tags:

GPU, CUDA, K-means, Cluster analysis

Related items

1	The Research On Fuzzy C-Means Cluster Analysis And Its Applications
2	Class Equality Cluster Validity Index And Cluster Filter K-Means Algorithm
3	Research And Application Of Improved K-means Algorithm In Multivariate Analysis System
4	Research And Application Of K-means Clustering Algorithm
5	Differentially Private K-means Clustering
6	Based On The K - Means Cluster Research And Realization Of The Web Information Retrieval
7	Research Of Improved K-means Algorithm And New Cluster Validity Index In Cluster Analysis
8	The Application Of Data Mining In Comprehensive Assessment Of National Area
9	The Research And Application Of Cluster Analysis On The Public Security Specialist Examination Analyze System
10	Research On The Evaluation Methods Of Cluster Analysis Results