Font Size: a A A

Hot Topic Extraction And Tracking Based On Chinese Microblog

Posted on:2018-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y T YeFull Text:PDF
GTID:2348330518473189Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Microblog,with it's wide range of participatory,has changed the way that people get news and keep in touch with current events since its birth.In recent years,many of breaking news and hot topics are first released by microblog,its spread speed and scope are traditional media cannot match.Vast amounts of microblog posts are released in every minute and second.These data information include every aspect of our daily lives and contain a lot of valuable topic information.If we can extract these topics correctly,it will help us to better understand the latest public opinion.Nevertheless,due to the data orders of magnitude is too large,we can't handle only by manual work.Meanwhile,microblog post is short and sparse so that traditional topic extraction and tracking methods can not be used directly.This thesis makes a deeply research on microblog hot topic extraction and tracking,and the main tasks are as follows:1.This thesis proposed an improved topic extraction model named MF-LDA(Microblog Features Latent Dirichlet Allocation)to extract hot topics from microblog posts.Different from the traditional LDA(Latent Dirichlet Allocation)model,MF-LDA incorporated five microblog's unique features: support,comment,retweet,publish time and user authority to better extract hot topics.The first three features are used to compute the attention value of each post.The user's authority value is computed by their fans and idols.Then divide the time of all microblogs into different time slices as well as calculated out the word frequency of each time slice.After that built the feature vector according to the previous calculation and add it into MF-LDA.The optimal parameters of the model are obtained by Gibbs Sampling training.Finally,the probability of each word can be gotten.The higher the probability,the more likely the word is to be a hot topic word.2.For tracking hot topics' evolution process,this thesis mainly considering from two aspects: topic structure and content.A hot topic life cycle model,named HTLCM(Hot Topic Life Cycle Model),is constructed to tracking topic structure's change.And the life cycle of a topic is divided into five stages: birth,growth,maturity,decline and disappearance.HTLCM can estimate the stage which a topic is in and determine whether this topic is a candidate hot topic through computing the amount of topic,growth rate and rise of rate in unit time.So that can understand the overall development of the topic.As for tracking topic's content change,this thesis proposed an algorithm which called HTT(Hot Topic Tracking).HTT integrated MF-LDA and HTLCM,first of all,assigned the candidate hot topics which labeled by HTLCM to the corresponding time window according to the release time.Then input the data of each time window into MF-LDA,so can get the hot topic in each time window with the most relevant keywords.Through the analysis of the keywords' change can track the changein the content of the topic.In order to validate the proposed model and algorithm,this thesis has tested the experiment on real data sets.The experiment results show the perplexity of MF-LDA is lower than LDA under the same conditions,but the coverage rate is higher than LDA.Meanwhile,the HTLCM model and the HTT algorithm can not only keep track of the hot topics,but also find the potential hot topics.According to the experiment results,the model and method proposed in this paper have good effect and practical significance in the extraction and tracking of hot topics.
Keywords/Search Tags:Microblog, Hot topic extraction and tracking, Microblog features, MF-LDA, HTLCM, HTT
PDF Full Text Request
Related items