Font Size: a A A

Event Detection From Microblogs Based On Topic Model

Posted on:2016-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:J X WangFull Text:PDF
GTID:2298330467491802Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the number of microblog users is growing, twitter and weibo have become important information platforms for media and individuals. The content of microblogs are usually short (less than140words) and contain wealth of social information, and also contain other redundant text that can’t give any information of event, so traditional text mining algorithms are not good at extracting microblog event. In this paper, we combined Chinese POS (Part Of Speech) tagging and LDA (Latent Dirichlet Allocation) topic model to extract event form microblogs and using increment clustering algorithm to determine the numbers of event. It showed that POS tagging can filter out useless information of microblogs and Ida model can represent text data into a low dimensional topic space, and increment clustering algorithm can find the number of the event. With experiment it also showed that combining POS tagging, LDA model and increment algorithm can improve the accuracy of extracting topic from microblogs.I did the following tasks in this thesis:First, I analyzed the traditional text model, especially Vector space model, Latent semantic analysis model and Probabilistic latent semantic analysis model. And it showed that those traditional text models have lots of weaknesses for microblog texts, then proposed LDA topic model to modeling the microblog text.Then, in the process of microblog event detection, I found that the preprocessing of microblog text had a great deal for microblog event extraction, microblog text contains lots of contents that have nothing to do with event. If I filter those contents, I can impove the accuracy of event extraction. So in this paper, I proposed Chinese POS tagging to filter the useless content. And with experiment it showed that Chinese POS tagging can impove the accuracy of topic extraction from microblog.Finally, I analyzed the traditional clustering algorithm of text. And it showed that the algorithm such as K-means had to predefine the number of event, but for microblog text, it is difficult to predefine the number of event. But with increment algorithm single-pass, there is no need to predefine the number, and I also impoved single-pass algorithm. With experiment it showed that the imporved single-pass algorithm can find the event of microblog texts effectively.
Keywords/Search Tags:topic model, topic detection, speech of tag, short text, single-pass clustering
PDF Full Text Request
Related items