Font Size: a A A

Lda Model Based Micro-blog Topic And Event Detection

Posted on:2015-09-22Degree:MasterType:Thesis
Country:ChinaCandidate:N WuFull Text:PDF
GTID:2308330479489713Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
At present, we are in the Internet era. In recent years, new social networking tools such as SNS, Micro-blog and We Chat rise rapidly, each of those has a huge number of user groups. Micro-blog is high real-time, concise(with in 140 words) and can be released in diverse ways, so it has become the main platform for Internet information. Micro-blog can be able to gather huge amounts of text data within a short period of time. How to deal with these complex and disordered Micro-blog text data and extract refined and valuable topics quickly, is a huge challenge for the improving of the technology of topic detection.A LDA-SP(Latent Dirichlet Allocation-Single Pass) algorithm for Micro-blog topic detection is proposed. Firstly, the basic process of topic detection is introduced, including the basic principles and implementation details of all aspects of using technology. In the opinion of traditional topic detection, text based on Vector Space Model as the model indicated the existence of a high dimension and the lack of semantic expression. Then the paper presents an improved algorithm with using Latent Dirichlet Allocation by modeling the Micro-blog texts. Through the analysis and comparison of several commonly used existing clustering methods, it selects Single-Pass clustering algorithm as the concrete implementation clustering method of topic detection. Finally, results of experiments show that the algorithm proposed in this paper can solve problem-lack of topics, and ensure the accuracy of topic detection.A method for computing the Micro-blog event similarity based on similar content is proposed. Because of the existence of the “difficult to distinguish” problem, Topic Models cannot present a good solution. Firstly the paper considers two Microblogs which are semantically similar, then it calculate their identical score by using the features about time and place, thus the method can infer whether they express the same event. The Single-Pass algorithm is improved by using this method. Finally specific data set is used for experiments. Results show that, compared to the previous methods, the proposed one has better effect in dealing with the “difficult to distinguish” problem.
Keywords/Search Tags:micro-blog, topic detection, LDA model, text clustering
PDF Full Text Request
Related items