Font Size: a A A

The Research And Implementation Of Topic Evolution Based On LDA

Posted on:2011-09-22Degree:MasterType:Thesis
Country:ChinaCandidate:K CuiFull Text:PDF
GTID:2178360308485689Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, it has become the main platform for the people to express their idea and opinion. As a concequence, it also becomes one of the most powerful driving forces for the public opinion. So, it is more and more important to know what happened on the Web before we can do any other things to it. In order to know the trends of the public opinion through massive textual data, topic detection, tracking and forecasting have gained a lot of academic interests. The research of topic evolution is among the most important ones.This thesis focused on the research of the topic evolution model based on text stream. This thesis mainly studies the methods of the topic detection and evolution followed by an analysis of the topic detection, tracking and text mining techniques. It also discussed the methods to help people to understand the topic evolutionary pattern behind those texts, and to discover new topics. The main contributions of this thesis can be summaried as follows.(1) Firstly, this thesis gives the research introduction of topic detection, tracking, and text mining techniques. At the same time, it analyses the shortage of short text data such as Micro-Blog, BBS, and so on. Then, on the basis of definition and model of topic evolution, this thesis takes the probabilistic topic model as its basic modeling tool.(2) Secondly, Latent Dirichlet Allocation (LDA) is extended to the context of online text streams, and an online LDA model is proposed and implemented as well. The main idea is to use the posterior of topic-word distribution of each time-slice to influence the inference of the next time-slice, which also maintains the relevance between the topics. The topic-word and document-topic distributions are inferenced by incremental Gibbs algorithm. Kullback Leibler (KL) relative entropy is uesd to measure the similarity between topics in order to identify topic genetic and topic mutation.(3) Thirdly, a new algorithm, iOLDA, is proposed to help the interactive process of topic modeling and discovery.. To abate those topics not interested or related, it allows the users to add supervised information by adjusting the posterior topic-word distributions at the end of each iteration, which may influence the inference process of the next iteration. Experiments are conducted both on English and Chinese corpus and the results show that the extracted topics capture meaningful themes in the data, and the proposed interaction policies can help to discover better topics.(4) At last, the research works mentioned above are implemented on the basis of the UIMA platform and integrated into the YHPods platform. Experiments on real data are performed to validate the effectiveness of the works as well.
Keywords/Search Tags:Internet Public Opinion, Text Mining, LDA, Probabilistic Topic Model, Parameter Estimation, Gibbs Sampling
PDF Full Text Request
Related items