Font Size: a A A

Research On Hot Topics Discovery In Microblog Based On Distributed K-means Algorithms

Posted on:2020-07-30Degree:MasterType:Thesis
Country:ChinaCandidate:J M XuFull Text:PDF
GTID:2428330611453219Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,more and more online media appear in our daily life.With its information diversity and real-time information release.Microblog has become an important part of commubication and information platform.A large amount of microblog data will emerge every day on microblog.How to find potential hot topics from these tens of billions of information,and it is especially important for users to get social hotspot information,monitor the public opinion of government agencies,make relevant decisions for business managers.We learn the background of hot topics from microblog.Tbe main work contents are as follows:1.The dissertation firstly summarizes the key technologies used in the microblogging hot topic discovery,and then briefly describes the Chinese word segmentation technology,text representation model,text similarity calculation and commonly used text clustering algorithm,and points out various kinds of poly Advantages and disadvantages of clustering algorithms.2.Designed a microblog hot topic discovery system.Firstly,the microblog API interface is used to collect and the collected data is filtered and cleaned.Secondly,the filtered microblog data is preprocessed by text,and then the NLPIR word segmentation system is used for Chinese word segmentation.At the same time,the integrated stop word list is adopted to remove the stop words;then perform text vectorization,use TF-IDF method to calculate the weight of each word in the text and sort it,extract the keywords;finally,use the improved K-means algorithm to perform keyword clustering operation,use microblog's heat calculation formula performs heat calculation on the clustering results,and arranges them in order to obtain a list of hot topics of microblogs.3.Aiming at the problem that K-means clustering algorithm relies on the initial clustering center selection,a distributed hybrid clustering algorithm based on Hadoop is proposed.Firstly,the Canopy algorithm is used to cluster the text vectorized data,and the obtained microblog data set is divided into K initial classes.Secondly,the obtained K value is used as the initial clustering center of K-means algorithm,and K-means algorithm performs secondary clustering.Through the experimental analysis of the crawled microblog data,the results show that the improved K-means algorithm is better than the single clustering algorithm in the recall rate,precision,and F1 value,and can be effectively found microblog hot topic.
Keywords/Search Tags:Hadoop, Clustering, K-means algorithm, Canopy algorithm, Hot topic
PDF Full Text Request
Related items