Research On Modeling Large Scale Datasets Based On Topic Models

Posted on:2016-08-26

Degree:Master

Type:Thesis

Country:China

Candidate:Y Lu

Full Text:PDF

GTID:2180330467497275

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Topic models are statistical methods that express the hidden structure of text datasets.Nowadays, topic modeling algorithms are one of the most popular tools for analyzingdata, such as texts, images and videos. However, in the era of data explosion, they arefacing significant challenges resulting from big data streams. Modeling large scalecollections becomes a significant direction in machine learning. To address thisrequest, an important attempt is the representative stochastic variational inference(SVI) algorithm proposed by Hoffman et al. The authors applied SVI to latentDirichlet allocation (LDA), which is acknowledged as the fundamental topic model.SVI for LDA (online LDA) has been shown successfully to infer big data streams. Ateach iteration, online LDA updates global parameters of interest by the stochasticnatural gradients, which are formed by a random subset (i.e., mini-batch) of thecorpus. However, the complexity of text data limits its performance. Two majorproblems are encountered, namely, only a few unique words appear in the subset,leading to noisy stochastic gradient, and the unique words have different appearingfrequencies, resulting in different speeds of convergence of parameters.To address the first problem, we propose an advanced method called momentumonline LDA (MOLDA). MOLDA incorporates a momentum term into the update ruleof the global variational parameters. Momentum term is the sum of the previousweighted stochastic gradients. It is very efficient to compute. Hence, MOLDA allowsus to efficiently reuse the previous information to smooth out the noise of thestochastic gradients.For the second problem, we propose a per-parameter adaptive learning rate (PPAR)method for online LDA. Our method uses second-order information to manage thediminutions of learning rates. The learning rates can be tuned adaptively based on the sampled data and the parameters. Hence, PPAR helps online LDA find a bettertrajectory to converge to a better optimum.To evaluate our method, we collect two very large text collections, both of whichcontain millions of documents. For MOLDA, we use the online LDA method as thebaseline to evaluate its performance. For PPAR, we compare it to three thestate-of-the-art learning rate methods. Empirical results show that MOLDA canconverge faster and obtain a better predictive distribution. PPAR outperforms theother three learning rate methods.

Keywords/Search Tags:

Topic Models, Latent Dirichlet Allocation, Bayesian Variational Inference, Stochastic Variational Inference, Stochastic Gradient Method with Momentum, Adaptive Learning Rate, Online Learning

PDF Full Text Request

Related items

1	Stochastic Variational Inference For Probabilistic Model And Its Application
2	Studies On Approximate Bayesian Inference For Complicated Degradation Models And Applications
3	A Stochastic Variance Reduction Gradient Method With Adaptive Learning Rate
4	An Outer Gradient Algorithm For Solving Stochastic Variational Inequalities
5	A Novel Generative Topic Embedding Model By Introducing Network Communities
6	The Application Of Several Types Of Supervised Learning Algorithms In The Parameter Estimation Of Stochastic Biological Models
7	A Research Of The Prior Distribution Effect On Bayesian Variational Inference Of Erlang Mixture Model
8	A Study On Efficient Bayesian Inference Methods Using Manifold Structures
9	Research On Application Of Graph Model Based On Bayes And Meta-learning
10	Research On Adaptive Variational Contrastive Divergence