Font Size: a A A

Research And Implementation Of Micro-blog Clustering Algorithm Based On Storm

Posted on:2019-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:X Y WangFull Text:PDF
GTID:2348330542491658Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,social media such as micro-blog has becoming more and more popular,accompanied by a large amount of micro-blog data generated by micro-blog users every day.How to find out the effective analysis of these data and dig out the useful information has great commercial value and social value.In the face of micro-blog short text,the traditional text vectorization model does not consider the semantic association information behind the keywords,resulting in the low accuracy of subsequent clustering analysis.At the same time,the classical clustering algorithm has some shortcomings in the clustering of micro-blog short text data,such as the random selection of the initial cluster centers in the K-means algorithm may lead to the result of clustering is not stable and the clustering results easily to fall into local optimum.In addition,traditional clustering algorithms are not efficient in processing the massive micro-blog data.In this paper,the above problems are studied,the main work and innovation of this paper are as follows:(1)After denoising and segmenting of the micro-blog text,we delete the stop words,and use the LDA topic model instead of the vector space model,through the combination of LDA topic model and K-means clustering algorithm for micro-blog users,the experiment shows that the proposed scheme compared with the vector space model combined with K-means has been improved on the performance index such as the accuracy of clustering.(2)Based on the above experiment,according to the deficiency of K-means algorithm,a scheme of selecting initial cluster centers based on data distribution is proposed,which improves the stability of clustering and avoids clustering results fall into local optimum.In addition,we also proposed an optimization scheme of weighted Euclidean distance based on information entropy,this scheme enlarges and reduces the distance according to the degree of difference between the attributes of data objects,so as to truly reflect the data objects in the process of clustering different properties of the role played by the difference.Finally through the experiment,we further to verify the feasibility of the above improvement scheme.(3)Aiming at the problem that the algorithm is not efficient to deal with the massive micro-blog data in the single machine environment,this paper builds a cluster distributed environment based on Storm by in-depth research on Storm stream processing platform.At the same time introducing Kafka message queue system,so that Storm can be parallelized as consumer to read data from Kafka message queue.In addition,the parallelized implementation of the improved K-means algorithm is applied to the Storm based stream processing platform.Through the experimental result shows that the parallelized algorithm has been greatly improved in Storm cluster environment.
Keywords/Search Tags:LDA Topic Model, K-means algorithm, Information Entropy, Storm System, Kafka System
PDF Full Text Request
Related items