Font Size: a A A

Research On Modeling Large Scale Datasets Based On Topic Models

Posted on:2016-08-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y LuFull Text:PDF
GTID:2180330467497275Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Topic models are statistical methods that express the hidden structure of text datasets.Nowadays, topic modeling algorithms are one of the most popular tools for analyzingdata, such as texts, images and videos. However, in the era of data explosion, they arefacing significant challenges resulting from big data streams. Modeling large scalecollections becomes a significant direction in machine learning. To address thisrequest, an important attempt is the representative stochastic variational inference(SVI) algorithm proposed by Hoffman et al. The authors applied SVI to latentDirichlet allocation (LDA), which is acknowledged as the fundamental topic model.SVI for LDA (online LDA) has been shown successfully to infer big data streams. Ateach iteration, online LDA updates global parameters of interest by the stochasticnatural gradients, which are formed by a random subset (i.e., mini-batch) of thecorpus. However, the complexity of text data limits its performance. Two majorproblems are encountered, namely, only a few unique words appear in the subset,leading to noisy stochastic gradient, and the unique words have different appearingfrequencies, resulting in different speeds of convergence of parameters.To address the first problem, we propose an advanced method called momentumonline LDA (MOLDA). MOLDA incorporates a momentum term into the update ruleof the global variational parameters. Momentum term is the sum of the previousweighted stochastic gradients. It is very efficient to compute. Hence, MOLDA allowsus to efficiently reuse the previous information to smooth out the noise of thestochastic gradients.For the second problem, we propose a per-parameter adaptive learning rate (PPAR)method for online LDA. Our method uses second-order information to manage thediminutions of learning rates. The learning rates can be tuned adaptively based on the sampled data and the parameters. Hence, PPAR helps online LDA find a bettertrajectory to converge to a better optimum.To evaluate our method, we collect two very large text collections, both of whichcontain millions of documents. For MOLDA, we use the online LDA method as thebaseline to evaluate its performance. For PPAR, we compare it to three thestate-of-the-art learning rate methods. Empirical results show that MOLDA canconverge faster and obtain a better predictive distribution. PPAR outperforms theother three learning rate methods.
Keywords/Search Tags:Topic Models, Latent Dirichlet Allocation, Bayesian Variational Inference, Stochastic Variational Inference, Stochastic Gradient Method with Momentum, Adaptive Learning Rate, Online Learning
PDF Full Text Request
Related items