Font Size: a A A

Research And Implementation Of Hot Topic Discovery On Microblog

Posted on:2015-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:X Q DingFull Text:PDF
GTID:2298330431493897Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and the comprehensive popularizationof mobile Internet, there are more and more diverse ways for users to communicatewith each other. As a new platform, microblog, for its unique flexibility andconvenience, is accepted by users and achieves popularity. Microblog brings greatconvenience to people’s lives, but also produces some side effects, such as somepeople use microblog deliberately spread of false news to adversely affect socialstability. If we can find these topics early, we can promptly take appropriate measures.Moreover, it is also easy to make microblog users falling into local information. Inthis case, users can not know what other users are concerning about and discussing inthe whole microblog network. Therefore, it is meaningful to timely discover hottopics from a lot of microblog information.The paper defines the heat of topic, from a quantitative point of view to express ahot topic. Microblog information which a topic contains have the later release date,the more the number of comments and forwarding number, the higher the heat of thetopic, the more likely it is a hot topic. A large number of academics from domesticand abroad have done many researches on hot topics discovery. Roughly sum up thesemethods: clustering algorithm, LDA model, emotion model, or improvement on abovebasic methods. During the research of microblog hot topic discovering, the paperneeds to solve the problem of microblog corpora. The traditional web crawler is notapplicable to crawl microblog information, and microblog API can only crawlmicroblog information on own microblog home. In order to crawl large numbers ofmicroblog information, the paper crawls a lot of microblog users information basedon the mutual relationship between microblog users, then grabs their latest microbloginformation published. Next, many preprocessing operations are need, including spamfiltering, segmentation, removal of stop words, removing useless information,characteristic word extraction and feature weight calculation. The result ofpreprocessing is generating feature vector for every microblog information text.Finally, because of the characteristic of microblog information increasing, select theappropriate Single-Pass incremental clustering algorithm to get multiple clusters.Each cluster represents a topic which contains much microblog information. In orderto select the hot topics from the found topics, the paper defines the heat of topic. Thelater time, the more the number of comments and forwarding about the topic, the heat is higher, the greater the likelihood of becoming a hot topic.Studies from a large number of scholars found that LDA topic model can also beused to find topics, but it requires multiple iterations to run a long time while handlinglarge amounts of data. However, LDA topic model is good at expressing themes, sothe Single-Pass algorithm and LDA model can be combined. The specific approach is:firstly, use Single-Pass clustering algorithm to deal microblog information. Then usethe LDA algorithm to process each cluster. Finally get microblog hot topics.Single-Pass and LDA are topics discovery algorithm. Each of them can be used aloneto find hot topics. If combine Single-Pass with LDA, it can get more accurate topicsthan the single use of Single-Pass, and it is faster than the single use of LDA.
Keywords/Search Tags:microblog, hot topics discovery, microblog API, Single-Pass algorithm, LDA model
PDF Full Text Request
Related items