Font Size: a A A

Topic Models Algorithm Based On Features, Priors And Constraints

Posted on:2015-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:X N WuFull Text:PDF
GTID:2268330428998567Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As one of the popular probabilistic topic models, Latent Dirichlet allocation (L-DA) has been recognized as useful tools for analyzing documents. It extracts semantictopics from co-occurrence of words in document level, transforms documents locatingin word space to the ones in topic space, then obtains the low-dimensional represen-tation of documents. However, LDA users usually face two problems. First,commonand stop words tend to occupy all topics leading to bad topic interpretability. Second,there is little guidance on how to improve the low-dimensional topic features for abetter retrival, clustering or classifcation performance. To solve these problems, were-examine LDA from three perspectives: continuous features, asymmetric Dirichletpriors and sparseness constraints.LDA uses discrete word frequency as input and assumes that words in corpus onlyhave relationship with word frequency. Continuous features treat words diferentlywith location information. They allocate higher value to the words which have higherfrequency in part of documents and lower frequency in corpus, allocate a lower valueto the words with higher frequency in corpus. Using continuous features as input ofLDA can decrease the feature value of common words and stop words, which lowers thelikelihood of common words and stop words in the topic-word distribution. However,common words are partly important to the inference and parameter estimation of topicmodels, which causes the improvement of continuous feature is not notable.The priors of LDA are always set to fxed symmetric values. Howerer, It’s moreaccurate if we evaluate prior by utilizing topics information in every iteration. Symmet-ric priors make common and stop words allocated to each topic with same possibility,while asymmetric priors increase the possibility that stop words are allocated to top-ics which have high priors and occur in several topics. In the training process, the learning of priors also improves the posterior of model and gets better presentation oflow-dimensional topic-based features.Usually, sparse information can present the meaning clearly. The common andstop words always occur in many topics with lower sparseness while some keywordsmostly have higher sparseness. We add sparseness constraints to the process of infer-ence and parameter estimation for encouraging the words with higher sparse topicsand punishing ones with lower sparse topics. In the way, we can solve the problemabout common and stop words, and improve the low-dimensional topic features.In this paper, we study three factors: continuous features, asymmetric Dirichletpriors and sparseness constraints of LDA, and build a factor graph which containsthese factors. We propose several novel BP-based algorithms to study the three per-spectives and evaluate them with various criterions. Experimental results show thatthe continuous features can improve the interpretability of topic-word distributionsby efectively remove almost all stop and common words, asymmetric priors for topicmodeling can improve the topic interpretation, document classifcation and clusteringperformance. Also, sparseness constraints can improve the overall performance.
Keywords/Search Tags:Latent Dirichlet allocation, belief propagation, continuous features, asymmetric Dirichlet priors, sparseness constraints
PDF Full Text Request
Related items