Font Size: a A A

Research On Hot Topic Detection Technology Of Weibo Based On Word2vec

Posted on:2020-07-10Degree:MasterType:Thesis
Country:ChinaCandidate:H H JuFull Text:PDF
GTID:2438330602959786Subject:Control engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Web2.0 and the popularity of mobile devices,microblog(Weibo for short)has gradually become an important way for people to communicate with each other and gain anecdotes from all over the world.More and more netizens express their feelings and opinions about an event through Weibo,which forms a network sensation different from traditional news media.Mining effective information from Weibo texts and extracting hotspots have profound implications for timely discovering network sentiment.Because of the small amount of words in microblog text and the inconsistency of context,there will be serious data sparseness in the text modeling process,which will affect the accuracy of topic detection.Therefore,the research on hot topic detection technology of microblog short text is very necessary.This paper has done a related research on microblog short text modeling and topic detection.The main work are as follows.(1)Obtain microblog short text and preprocess.In order to obtain more topical Weibo,firstly,it is proposed to use the web crawler technology to crawl the influential famous authenticated users account microblog to obtain data.Compared with the data obtained through the interface provided by the official platform,the web crawler is more convenient and can get more data;then screen out the microblog text with less content to reduce redundancy;finally use the Jieba participle to segment the short text of Weibo and further remove the stop words.(2)Improve the topic modeling method.Focusing on the problem of data sparseness and the difficulty of expanding external corpus when establishing text model for Microblog short text,proposed to input feature words into Word2vec's skip-gram model to train word vectors,and to obtain words that are similar to feature words and expand into short texts.Then use the Latent Dirichlet Allocation to model the expanded text to extract the theme.This method not only solves the serious sparseness problem that LDA will face when used for short text,but also retains the advantage that LDA can solve the relationship between semantics,and finally improves the accuracy of microblog short text model.(3)Improve the hot topic detection related algorithm.For the shortcomings of large calculations of the traditional Single-Pass clustering algorithm,an improved Single-Pass clustering algorithm is proposed.The centroid vector is selected as the topic center in the document-subject matrix output by the LDA model,and the input text only needs to be compared with the topic center for similarity.The algorithm effectively improves the clustering speed.Through the improved Single-Pass clustering,the preliminary clustering of the document-thematic matrix are used to obtain the preliminary hot topic,and then the HAC merge topic is used to obtain the hot topic with higher cohesion.Through the above research,this paper completed the research of microblog short text hot topic detection technology based on Word2vec.The three parts of microblog data collection,related text modeling and topic detection are described in detail.For short text data sparse problem,the LDA algorithm based on Word2vec extended short text is proposed,and the improved Single-Pass clustering combined with HAC algorithm is adopted to consolidate topics.Experiments results show that compared with the traditional hot topic detection algorithm,the proposed algorithm can effectively improves the extraction accuracy of microblog short text hot topics.
Keywords/Search Tags:Hot topic detection, Word2vec, LDA, Single-Pass, HAC
PDF Full Text Request
Related items