Font Size: a A A

Research On Fast Gibbs Sampling Topic Inference Algorithms For Topic Models

Posted on:2019-09-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:X T ZhouFull Text:PDF
GTID:1368330548456775Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the increasing popularity of smart phones and the rapid development of Internet,especially the mobile Internet,the number of text data on Internet is increasing explosively.Meanwhile,governments,enterprises and individuals are in increasing demand for intelligent text mining methods.To address these needs,a series of intelligent text mining methods have been put forward in academic circles.Among the numerous text mining methods proposed in recent years,topic model is a kind of unsupervised learning method which can effectively mine and discover latent semantic topics hidden in text data.Using topic models to accurately and quickly discover latent semantic topics hidden in text data can meet our needs to organize and manage a large number of texts at a higher conceptual level to a greater extent.Thus,improving the “accuracy” and “time-efficiency” are two key issues in the research field of topic models.Among them,it is an important research direction to improve the “time-efficiency” of topic mining process on the premise of “accuracy”.This thesis mainly focuses on the “time-efficiency” of topic mining process in topic models.The main purpose of this thesis is to propose fast Gibbs sampling topic inference algorithms without changing the “accuracy” of algorithm results.Concretely,1)for Latent Dirichlet Allocation(LDA)which is a kind of more representative and general topic model,this thesis proposes a fast Gibbs sampling algorithm which is more suitable for topic inference on long text data sets;2)for Biterm Topic Model(BTM)which is used to discover latent semantic topics hidden in short text data sets,this thesis proposes two fast Gibbs sampling topic inference algorithms.In detail,the main work of this paper is as follows:(1)Aiming at the “recycle computation” problem that occurs in the topic inference process of Sparse LDA algorithm for LDA model,we propose a fast Gibbs sampling topic inference algorithm(ESparse LDA)for LDA based on Sparse LDA.ESparse LDA is as accurate as Sparse LDA and more time-efficient than Sparse LDA.Originally,Sparse LDA is an accurate and fast Gibbs sampling topic inference algorithm for LDA.However,its time-efficiency is restricted and cannot be further improved,since the word types of two adjacent tokens are usually different and the amount of computation that can be recycled in topic inference process is limited.The main idea of ESparse LDA to solve this problem is to first rearrange the tokens within one text according to the word types so that the tokens of the same word type within one text are aggregated together,and then recycle more computation by usage of cache strategy in order to achieve the purpose of improving the time-efficient finally.ESparse LDA accomplishes the same task as Sparse LDA while making no approximation and ensuring the exactness of algorithm results.Meanwhile,we make detailed theoretical explanations and comparative experimental analyses on the correctness of the algorithm idea,the exactness of the algorithm results and the time-efficiency of the algorithm speed.In theory,the time complexity of ESparse LDA is lower than that of Sparse LDA.The corresponding experimental results show that ESparse LDA can outperform Sparse LDA 31.85% on the different data sets used in the experiment.From the practical point of view,ESparse LDA is more suitable for long text data sets with a relatively small number of word types and a relatively large number of tokens in the text.In addition,it is important to note that the core idea in ESparse LDA has certain generality,and it can also be used to propose fast Gibbs sampling topic inference algorithms for some other topic models.(2)In view of the “high time complexity” problem and the “long convergence time” problem exist in the topic inference process of BTM model,we propose a fast Gibbs sampling algorithm(Sparse BTM)for infering topics in BTM.BTM is a kind of effective topic model used to discover latent semantic topics hidden in short text dataset.However,its standard Gibbs sampling inference method(Std BTM)has the “high time complexity” problem and the “long convergence time” problem.In response to these problems,we propose a fast Gibbs sampling algorithm(Sparse BTM)based on Std BTM for infering topics in BTM.The main idea of Sparse BTM is to reduce the unnecessary computation in Std BTM by both recycling intermediate results and utilizing the sparsity of count matrix NT Win BTM in order to achieve the purpose of reducing the time complexity of inference algorithm and the convergence time of the model finally.Essentially,Sparse BTM makes a tradeoff between time and space consumption.In other words,it reduces part of the time cost by adding part of the space cost.Theoretically,the time complexity of Sparse BTM is lower than that of Std BTM.The corresponding experimental results show that the convergence speed of Sparse BTM can reach 18 times of that of Std BTM under the condition of larger number of topics(K = 1000).(3)Aiming at the “recycle computation” problem that occurs in the topic inference process of Sparse BTM algorithm for BTM model,we propose a fast Gibbs sampling topic inference algorithm(ESparse BTM)for BTM based on Sparse BTM.ESparse BTM is as accurate as Sparse BTM and more time-efficient than Sparse BTM.Originally,Sparse BTM is an accurate and fast Gibbs sampling topic inference algorithm for BTM.However,its time-efficiency is restricted and cannot be further improved,since the biterm types of two biterm tokens are usually different and the amount of computation that can be recycled in topic inference process is limited.The main idea of ESparse BTM to solve this problem is to first rearrange the biterm tokens within the whole biterm data set according to the biterm types so that the biterm tokens of the same biterm type within the whole biterm data set are aggregated together,and then recycle more computation by usage of cache strategy in order to achieve the purpose of improving the time-efficient finally.ESparse BTM accomplishes the same task as Sparse BTM while making no approximation and ensuring the exactness of algorithm results.Meanwhile,we make detailed theoretical explanations and comparative experimental analyses on the exactness of the algorithm results and the time-efficiency of the algorithm speed.From a theoretical point of view,the time complexity of ESparse BTM is lower than that of Sparse BTM.The corresponding comparative experiment results show that the time-efficiency of ESparse BTM is higher than that of Sparse BTM,especially on the biterm data set where the ratio of the number of biterm types to the number of biterm tokens is smaller.Specifically,the time-efficiency of ESparse BTM can be 39.5% higher than Sparse BTM on the different data sets used in the comparative experiment.
Keywords/Search Tags:Topic Model, Latent Dirichlet Allocation, Topic Inference, Gibbs Sampling, Biterm Topic Model
PDF Full Text Request
Related items