Font Size: a A A

The Clustering Study Of Massive Short Text

Posted on:2016-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2308330473465482Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and computer technology, information explosion ways to grow,such as: Twitter and Sina Weibo, as the representative of the new development platform. These platforms every minute amount of information released more than 1000,000.Now we are living in the era of big data. These vast amounts of data affect our lives.It’s an effective way to obtain information. But these data have lots of characteristics,such as massive, diverse, dynamic,less words and so on. Therefore people call it for short text data. The most important thing is that how to dig out hidden user information from these massive short text data by quickly, accurately and easily in order to understand the needs of customs and create business opportunities. In this case the text mining technology is birth, text mining can dig out the valuable information from massive short text data, which is officially our mining purposes. The core of text mining technology is algorithm,because a good algorithm can improve the speed and quality of mining.Bue to take into account these short text data with massive and diversity. A clustering algorithm is better, because the clustering algorithm is unsupervised learning. Based on the characteristics of large scale short text, we propose a bit solutions:(1) I selected clustering algorithm is K-means, because of the high efficiency and easy to implement of the algorithm. and the algorithm in the calculation of each data point and the distance cluster center are operating independently, so you can achieve the algorithm parallelization. However, the initial value choice of the algorithm have a great influence on the clustering results.It’s easy to fall into local optimum.In this paper combine PSO(PSO) algorithm with K-means clustering algorithm.Becauce particle swarm algorithm has the ability of global optimization, which can overcome the disadvantages of K-means clustering algorithm.(2) Processing massive data is beyond the computing power of IT infrastructure. Faced with this problem, we introduce the architecture of Hadoop. Because the Hadoop system can store massive data by data block.these vast amounts of data distributed different machines, while can achieve parallel processing. This not only solve the memory problems but also solve the time consuming problems.Based on the above two solutions, we propose a distributed clustering algorithm DPSOKmeans(Distributed PSOK-means clustering algorithm). After tests showed that the clustering algorithm is run on Hadoop framework.It has a good convergence,clustering quality high and handle vast data and so on. However, this algorithm still has shortcomings, such as the choice of the initial value will affect the clustering results.It may appear empty cluster phenomenon, this shortcoming will solve in the next step.
Keywords/Search Tags:Massive, short text, clustering, K-means, PSO, Hadoop
PDF Full Text Request
Related items