Font Size: a A A

Fast Belief Propagation-based Arallel Topic Modeling Techniques

Posted on:2013-03-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q GuFull Text:PDF
GTID:2248330371494132Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Probabilistic topic modeling uses unsupervised learning algorithms to automaticallyorganize, understand and summarize the main themes from a collection of documents. Ithas been one of the hotspots in machine learning community for a long time. Hofmannpresented the probabilistic Latent Semantic Analysis (pLSA) for probabilistic topicmodeling, while Blei and colleagues proposed a more complete probabilistic topic model,Latent Dirichlet Allocation (LDA) which further extends pLSA within the Bayesianframework. Firstly, LDA creates the model of corpus, and then applies VariationalBayes(VB), Gibbs Sampling(GS) or Belief Propagation(BP) to infer the model parameters.Finally,it can investigate the implicit the structure of topics of the data, and obtain theimportant information of documents.It has always been the concern by scholars that how to efficiently perform topicmodeling on large-scale corpus to meet the needs of the computing time and memory, sothat the potential topic discipline can be uncovered from the modeling. In this paper, weuse efficient OpenMP and MPI-based parallel algorithms to speed up the topic modelingfor massive data sets. We develop the parallel algorithms to infer the LDA parameters. Wefind that parallel fast BP algorithm performs the best way to infer the model parameterswith highest accuracy and fastest speed and effectively deal with the large-scale data.Based on cluster systems or multi-core servers to share storage load and improve the rateof calculation, parallel algorithms not only solve the bottleneck problem of the stand-alonecomputer, but also improve system reliability, availability and scalability. The paper isorganized as follows:1. We introduce topic models and inference algorithms, which can be the theoreticalfoundation to solve the parallel algorithm. And it is indicated that fast BP algorithm tendsto gain more advantages over other algorithms in speed and accuracy by experiments. 2. MPI-based parallel fast BP algorithm is proposed and applied to clusters withindependent memory. First, we divide large-scale corpus into many smaller parts which areassigned to the cluster system for processing. Then we use parallel fast BP algorithm to infer model parameters. Finally, we reduce the results of these calculations to the finalresults. Experimental results show that this algorithm greatly improves the computationalefficiency and scalability.3. We develop OpenMP-based parallel fast BP algorithm that uses the sharedmemory on the server. First, we divide task according to corpus and assign mini-tasks tothe multi-processors servers. Then we synchronize and estimate model parameters throughthe shared memory. We find that OpenMP-based parallel fast BP algorithm can greatlyimprove the computational efficiency due to shared memory mechanism.
Keywords/Search Tags:Topic Modeling, LDA, GS, BP, OpenMP, MPI
PDF Full Text Request
Related items