Font Size: a A A

Research And Implementation Of Hot Topic Detection On Microblog

Posted on:2018-07-13Degree:MasterType:Thesis
Country:ChinaCandidate:D H WeiFull Text:PDF
GTID:2428330542989922Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,microblog has made considerable progress in technology,the quality of service,etc.Now,it has become one of the most popular applications in the 21th century,and generates massive microblog data every day.In the context of big data,mining hot topics in microblog has become a key issue in the related fields of this research.However,microblog has some characteristics such as short context,free expression,as well as sparse feature words etc.makes the traditional text processing and hot topic detection theories or methods can't apply to the microblog data efficiently.In addition,microblog data,which contains people's views on current affairs,social status,living environment,as well as the interpersonal relationships or communications,has a large scale and grows at massive speed.Therefore,how to dig out the hot topic more efficiently in the billions of microblog data has the great social value and practical significance.Aiming at improving the efficiency of hot topic detection on massive microblog data,this paper researches on distributed hot topic detection methods and techniques for massive microblog data,and has built a Hadoop-based microblog hot topic detection platform,which implements the hot topic detection and analysis on massive microblog data.The research work of this paper mainly has the following three aspects:(1)Deep study of the open-source distributed framework Hadoop,microblog information and transmission characteristics.Firstly,this paper designs and implements distributed microblog data acquisition based on Hadoop,improving the efficiency of data acquisition.Secondly,for the massive microblog data,uses HBase to store them,which can improve the efficiency and flexibility of storage,and provide data support for microblog data analysis.Then on the base of these,implements distributed pretreatment for the massive original microblog data,improving processing speed and efficiency.(2)For the massive microblog textual data,this paper studies and implements the distributed method of hot topic detection on microblog.Above all,this paper modeled the microblog textual data with LDA(Latent Dirichlet Allocation)topic model which has excellent ability to reduce dimension,this not only can solve the problem of sparse data,but also improve the spatial and temporal efficiency of text similarity calculation.And then,this paper designs and implements a parallel clustering algorithm which combined Canopy algorithm with K-means algorithm to detect hot topics on microblog.Firstly,the parallel Canopy generation algorithm is used to pre-clustering the results after LDA modeling.Secondly,uses the parallelized K-means clustering algorithm to execute the secondary clustering,so as to mine the specific topic clusters and detect hot topics.Finally,we use the collected microblog data to carry out experiments,verify the project can not only improve the efficiency of clustering,but also detect the hot topics from massive microblog data more quickly and effectively.(3)Combined the above research work,this paper builds a microblog hot topic detection platform based on the Hadoop.The platform adopts hierarchical architecture,which integrates data acquisition and storage layer,data preprocessing layer,hot topic detection layer and hot topic presentation layer.Furthermore,the modules between each layers coupled loosely,and this platform has good expansibility.
Keywords/Search Tags:microblog, hot topic detection, LDA topic model, clustering
PDF Full Text Request
Related items