Font Size: a A A

Asymmetric-Prior Author Topic Models

Posted on:2012-02-27Degree:MasterType:Thesis
Country:ChinaCandidate:W XueFull Text:PDF
GTID:2178330332476014Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
This paper proposes a novel model which can simultaneously analyse the semantic structure of document collections and interests of these authors, named Asymmetric-prior Author Latent Dirichlet Allocation(AALDA).An explosive amount of web text information is challenging information retrieval and machine learning areas. In web search engine optimization, how to analyse these data in dig-ital libraries, how to model information and estimating new documents are always heated issues in academic research field. Generative models are the most popular and effective tools for analysing large-scale text data. Not only they can extract the interpretable semantic structure of text information, but also can predict the properties of new documents. Latent Dirichlet Allocation is one of the most popular models in topic models. It considers data as random mixture of latent multiple topics using hierarchical Bayesian structure. In order to simplify the original likelihood-maximization problem, hidden random variables are intro-duced. Using Expectation-Maximization method, we can estimate the posterior distribution of latent random variables in Expectation step. On the other hand, maximizing the likeli-hood of observed and hidden random variables in Maximization step. As a result, we can obtain important information about the observed data.This thesis gives the details of Gibbs sampling for approximation of the posterior distribution of hidden random variables. Then, we introduce a new topic model in which symmetric priors are replaced by asymmetric priors. In fact, this technique introduces Polya distributions over latent variables z, which could be described by Polya Urn Scheme. The effect of the richer get richer could help models to automatically refine the number of latent topics. The new model treats the hyper-parameters as unfixed values, and uses maximizing likelihood method to estimate. To accelerate the optimization of Polya distributions param-eters, we use Mink's fixed point method and its modified version to avoid nested Gibbs sampling iterations. We proposed a novel model, named AALDA, which simultaneously extract topics of documents and research interests of authors. In this model, authors correspond to asym-metric priors which can reflect authors' research area. Three groups of experiments on the NIPS paper corpus present the new model successfully recover the research hot topic in NIPS conference in the late of last century. Moreover, it also reveals the distinct research bias of some well-known scholars. In addition, the improvement of perplexity compared with original LDA model demonstrates the superiority of the new model.
Keywords/Search Tags:Machine Learning, Topic Models, Probabilistic Graphical Models
PDF Full Text Request
Related items