Font Size: a A A

Research On High-dimensional Text Data Clustering Algorithms And Parallel Design

Posted on:2020-03-06Degree:MasterType:Thesis
Country:ChinaCandidate:X L ShanFull Text:PDF
GTID:2428330590996822Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and the extensive use of various social media,the volume of text data has increased dramatically.Text document clustering is an effective way to filter the information that users are interested in.However,due to the typical high-dimensional and sparse characteristics of text data,it adds difficulties to the unsupervised learning task of clustering.Although there are many improved text clustering algorithms,they still cannot meet the needs of practical applications in terms of accuracy and real-time performance.Therefore,this thesis makes further efforts in this direction and proposes a parallel k-means clustering algorithm for high-dimensional sparse text data.The parallel k-means clustering algorithm for high-dimensional text data proposed in this thesis is called pkmeans algorithm.Pkmeans algorithm attempts to solve the problem of how to cluster high-dimensional and sparse text data accurately,so as to provide good preprocessing results for data mining,data analysis and other tasks.Pkmeans algorithm can be mainly divided into three parts: data dimension reduction module,clustering algorithm module and parallel design module.The major contributions of pkmeans algorithm are as follows: Firstly,in the dimensionality reduction module,a dimensionality reduction model SAE based on self-coding network to realize feature selection.Its goal is to solve the characteristics of high-dimensional and sparse text data and improve the accuracy of extracting text data keywords as much as possible.Secondly,in the clustering algorithm module,a method of initial center selection based on Density and k-means++ is proposed.The combination of these two methods not only considers the influence of boundary points,but also gives consideration to the precision.Based on the design of the above two modules,the accuracy of high-dimensional text data clustering algorithm is improved.Finally,in the algorithm module of parallel design,CUDA architecture and MPI messaging interface are used to realize the parallelism of the algorithm,so as to improve the running speed of the algorithm.Finally,this thesis verified the feasibility of the proposed pkmeans algorithm,and made a large number of comparative experiments on the real data set to test its algorithm accuracy and running speed.Experimental results show that the SAE model can largely avoid the high-dimensional sparsity of text data and extract meaningful features.In addition,the clustering algorithm proposed in this thesis also shows its friendliness in most data sets.Compared with the improved algorithm of other k-means algorithms,it also has certain advantages.Finally,the parallelism of the algorithm implemented in this thesis has been greatly improved compared with the cpu-based algorithm in terms of running speed,and the use of MPI technology has also improved the portability of the algorithm.In this way,the pkmeans algorithm proposed in this thesis not only improves the accuracy of the algorithm but also improves the running speed of the algorithm.
Keywords/Search Tags:Text Clustering, Autoencoder, K-means, CUDA framework, MPI Messaging Interface
PDF Full Text Request
Related items