Font Size: a A A

Research And Implementation Of Distributed Topic Clustering Technology For Text Flow

Posted on:2018-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:L Z TangFull Text:PDF
GTID:2428330569999067Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The topic model is the key technology of text subject clustering,and it is widely used in text analysis,news recommendation,public opinion monitoring and so on.Latent Dirichlet Allocation(LDA)is a successful subject model,which can solve the problem of text subject clustering by specifying the number of topics.However,it is difficult and impractical to determine the number of topics in the corpus,especially the number of topics that determine the flow of text in a streaming scenario.Hierarchical Dirichlet Process(HDP)is an application of nonparametric Bayesian model in topic modeling.Its emergence effectively overcomes the problem of the evolution of text topic and the number of topics faced by LDA in the process of text flow clustering Dilemma.In order to apply the topic model in the streaming computing scene to achieve large-scale text flow subject clustering,this paper has done the following work:(1)An online variational Bayesian inference method based on parameter server is proposedIn order to solve the problem of large-scale text subject clustering,this paper introduces the parameter server model to solve the problem of storage,distribution and synchronization of model parameters.The online variational Bayesian inference method of LDA model is not implemented to solve the problem of large-scale text subject clustering.Based on the contribution ratio of the RDD training partition to the model parameter update,an update strategy of the model parameters in the partition is proposed,and a distributed algorithm of online variational Bayesian inference is proposed.Based on the parameter distribution and updating of the parameters,the distributed architecture of the LDA online variational Bayesian inference method is designed and implemented on Spark.The experimental results show that the online variational Bayesian inference method based on the parameter server is based on the Spark MLlib to improve the convergence performance,the running speed and the ability to solve the large-scale problem.(2)Distributed optimization and implementation of hierarchical Dirichlet processCompared with the LDA model,the HDP model has more implicit parameters,and its variational Bayesian inference is more complicated.Aiming at the problem that the online variational Bayesian inference method of HDP model is not distributed,it is difficult to solve the problem of large-scale text flow subject clustering.In this paper,we analyze the HDP online variational Bayesian inference method in distributed environment The hierarchical problem of HDP online variational Bayesian inference is designed based on the parallelism of data and the parallel thinking of the model,and the HDP model distributed system prototype is implemented on the Spark with the parameter server.The experimental results show that the distributed HDP system can effectively converge,and the training speed is greatly improved at the cost of slightly sacrificing convergence performance compared with the current stand-alone algorithm.This paper makes the text flow window training time reduced to a few minutes level,so that HDP text flow theme clustering technology into a practical stage.
Keywords/Search Tags:Topic Model, Text Topic Clustering, Latent Dirichlet Allocation(LDA), Hierarchical Dirichlet Process(HDP), Variational Bayesian Inference, Parameter Server
PDF Full Text Request
Related items