Font Size: a A A

Micro-blog Hot Topics Detection Method Based On Hybrid Clustering

Posted on:2018-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y N ZhangFull Text:PDF
GTID:2348330515966799Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Network media emerged with the rapid development of Internet technology.Micro-blog with its real-time information release,platform diversity and grassroots content,to quickly replaced traditional media in a few years and become the most used tool in people's social life.Micro-blog platform will produce tens of millions or even billions of data every day.How to dig out the hidden hot topic from massive data,to grasp the dynamics of public opinion in a timely manner,and to obtain the real information that needs becomes more and more important.The traditional long text topic detection technology has been very mature.However,the length of micro-blog text is much shorter than that of traditional news report,which is controlled within the range of 140 words.Using the traditional topic detection method in micro-blog text will produce sparse feature,semantic information loss and other issues.Therefore,this article adopts LDA(Latent Dirichlet Allocation)model,which has a better representation of text,to model the text.Then the text vectors are clustered by the improved hybrid clustering algorithm,and finally the hot topics is found through heat sorting.Firstly,according to the characteristics of micro-blog text,the paper preprocess the data,including data filtering,Chinese word segmentation,to stop words and so on.Then by contrasting the traditional VSM modeling method and the LDA modeling method,the LDA model with better modeling results is selected to model the text,and the "document-theme" matrix is obtained as the the feature vector of micro-blog text.The LDA modeling method solves the problem of data sparseness and semantic deletion,and effectively reduces the data dimension.Then,aiming at the advantages and disadvantages of the hierarchical clustering algorithm and partitioned clustering algorithm,this paper improves it respectively.The improved algorithm uses the idea of minimum tree partitioning to sort the distance between the clustering data,which avoids the repeated calculation of the distance when the hierarchical method merges the clusters.Then,the center of the initial clustering is calculated as the initial cluster center point of the K-means algorithm,which overcomes the difficulty of selecting the initial point of K-means algorithm.Finally,the K-means algorithm is used to re-cluster to correct the result of hierarchical clustering.Finally,this paper proposes a formula to calculate the topic heat,and uses the calculated heat to sort the clustering results to get the final hot topic.The experimental results show that the performance of the micro-blog hot topic detection method based on hybrid clustering algorithm is better than that of other single clustering algorithm.The results of the final hot topics can basically reflect the hotspots of the day.
Keywords/Search Tags:micro-blog, hot topic, vector space model, topic model, hierarchical clustering algorithm, K-means algorithm
PDF Full Text Request
Related items