Font Size: a A A

News Topic Detection Research Based On Semi-supervised DPMM

Posted on:2018-10-21Degree:MasterType:Thesis
Country:ChinaCandidate:D D YaoFull Text:PDF
GTID:2348330539985368Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Higher requirements are put forward for the Internet technology innovation with the rapid development of China's information industry,and more challenges have been brought by the boom of information data.Topic detection and tracking(TDT)research system has been a significant part of network public opinion mining.Topic detection is able to organize distracting texts with the unit of topic and has been an effective way of public opinion analysis.Based on the unsupervised feature of topic detection,the preferred technique is clustering analysis.However,traditional clustering algorithms have defects more or less and some basic problems like K-value's determination exist.This paper begins from the perspective of text semantic relations,carries on some research work with Dirichlet Process Misture Model(DPMM),which is the basic model of non-parameter Bayesian models.The main research work of this paper is as follows:1.Based on the research and analysis of general DPMM model,this paper discusses the feasibility of DPMM for topic detection and analysis,proposes the derivation of sampling fumula applied to topic analysis and optimizes the cluster number K with the theory of hierarchy.2.To propose the concrete process of semi-supervised DPMM model,a small account of hot key words are imported as prior knowledge to guide the independent clustering process of DPMM,which is based on the mutex relationship of the words group.And through the analysis of the power-law distribution of word frequency and the positioning function of noun entity,an effective hot key words selecting method is given.3.An extension research of semi-supervised DPMM is given by analyzing the generalization ability of semi-supervised mothod and being applied to LDA model.And for the problem of topic fusion resulted from the imbalance of data,the OPTICS density method is used to get the reachability graphs of result clusters and analysis.To measure the performance of related experiments in this paper,the standard TDT4 corpus and news corpus from Internet are both used.Experimental results show that the semi-supervised DPMM model given in this paper can both determine the clustering number automatically and improve the detection performance significantly.The semi-supervised method achieves good adaptability and performance in LDA model,and the OPTICS analysis mothod has good weakening effect for the fusion of topic results.
Keywords/Search Tags:Topic Detection, Dirichlet Process, Gibbs Sampling, Semi-Supervised Method, Hot Key Words, Density Analysis
PDF Full Text Request
Related items