Font Size: a A A

Topic Detection And Tracking Based On Dirichlet Process Mixture Model

Posted on:2014-10-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:C WangFull Text:PDF
GTID:1268330401963114Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Internet has become one important way of obtaining news. How to group large volumes of news stories according to the latent topics and track news of a specific topic can not only efficiently reduce time of mastering interested news for users, but also offers an efficient topic oriented information organization. Two key problems must be solved in implementing the topic oriented information organization:how to automatically group initial news stories according to the latent topics discussed in stories;and how to automatically associate incoming stories with topics that are known in advance, or cluster them into new topics. These two problems are corresponding to topic detection and topic tracking.Lots of progress has been made on the research of topic detection and tracking, however, there are still some defects in them. For instance, how to precisely decide the number of topics in topic detection task, how to deal with serious data sparseness problem, topic excursion and topic deviation problem in topic tracking task.To overcome the above problems, this thesis investigates a Bayesian non-parametric approach called Dirichlet Process Mixture Model (DPMM). Firstly DPMM is implemented on topic detection and topic tracking separately. Then DPMM is refined to resolve the two tasks simultaneously, and is verified to be effective under various data settings. Finally, through integrating topic detection and tracking, a system scheme is designed to reduce time of mastering interested news for user and meet the application requirement of topic-oriented Internet information organization. The main research work and achievements are as following:(1) To overcome the subjectivity in determining the number of topics due to lack of prior knowledge of the topic, a topic detection model based on DPMM is proposed in this thesis. The model does not fix the number of topics, but determines it through processing news stories automatically. DPMM assumes that every story is corresponding to a topic distribution, and assigns the topic corresponding to the maximum probability to this story. The experimental results indicate that topic detection model based on DPMM achieves better performance than several existing methods. The lowest detection error cost is0.0981, decreased by more than50%compared with the traditional cluster-based topic detection models.(2) To smooth the word independence assumption in DPMM, the contextual information is introduced in Gibbbs sampling during parameter inference. The improved sampling method takes contextual words into account when obtaining sampling probability of a word, which reflects real word correlations in a natural language. The experimental results show that the improved parameters inference method can yields better performance of topic detection.(3) To alleviate the influence of lacking on-topic stories in static topic tracking task, the prior knowledge of known topics is efficiently exploited and used in Gibbs sampling procedure. Then, the results of topic tracking are obtained by making a vote on Gibbs sampling results. As indicated by the experiments, the prior knowledge can improve the performance of topic tracking significantly even with a few on-topic stories. The lowest tracking error cost is0.0723, decreased by45%compared with the topic tracking method based on unigram model. Moreover, vote method can ensure the stability of performance.(4) To overcome topic excursion and topic deviation brought by existing adaptive learning mechanisms, the thesis presents a new adaptive tracking method based on DPMM. The basic idea of adaptive tracking method is to endow tracking feedback with a metric, M_reli, to control errors brought by feedback of off-topic stories. The experimental results show that the adaptive DPMM model, without a large scale of in-domain data, can solve topic excursion of topic tracking task and topic deviation brought by existing adaptive learning mechanisms significantly. The lowest tracking error cost is0.0677, decreased by6%compared with static topic tracking model.(5) Based on the above technologies of topic detection and topic tracking technology, a topic detection and tracking system is designed to meet the practical application requirement. The system scheme firstly organizes news stories streams by taking story cluster as a unit, per story cluster corresponds to a topic, and obtains tags describing topic from news stories streams. Finally, story clusters and topic tags are presented to users. The system scheme can achieve the goal of reducing time of mastering interested news for users and organizing Internet news stories according to the latent topics.
Keywords/Search Tags:topic detection and tracking, topic detection, topictracking, dirichlet process mixture model, gibbs sampling, priorknowledge of known topic
PDF Full Text Request
Related items