Font Size: a A A

Research On Distributed Stochastic Variational Inference Algorithms For Big Data

Posted on:2022-01-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Y GuoFull Text:PDF
GTID:1488306569984219Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the past few years,Stochastic Variational Inference has shown its power in a variety of machine learning tasks,and its applications have covered various fields such as natural language processing and information retrieval.Applications in various fields continue to collect data to process,leading to the era of big data.Nowadays,the growth rate of data has far exceeded the growth rate of hardware capabilities.Therefore,at this stage,the use of distributed platforms has become the workhorse under the solutions for big data training.Unfortunately,most of existing research on stochastic variational inference is still at the stage of solving applied mathematics problems.The design of distributed stochastic variational inference includes more system engineering technologies,such as the design of partition and aggregation for datasets and models,algorithm complexity,communication cost,and many other aspects.When dealing with massive data,distributed stochastic variational inference brings the following new challenges:1.First,as the amount of data and model dimensions increases,it is necessary to use a distributed computing environment to accelerate the training process of stochastic variational inference algorithms.On the other hand,the increasing popularity of the distributed platform,Apache Spark,has attracted many users to put data into its ecosystem.However,combined with industry experience and existing research,Spark is slow when running distributed machine learning algorithms,which include stochastic variational inference algorithms.An existing solution is that the user can switch the training task to other dedicated systems that claim to have better performance,such as parameter server,but the user must go through the painful process of moving data in and out of Spark,which brings new costs.2.Secondly,it is generally believed that communication of the distributed stochas-tic variational inference algorithm is inefficiency.However,the theoretical research in this area is still blank.At the same time,the communication interval of the distributed stochastic variational inference algorithm has an important impact on training efficiency and inference quality.However,it is difficult to estimate and select a reasonable commu-nication interval.3.Third,despite extensive and in-depth research on the use of stochastic variational inference algorithms to solve problems on small and static datasets,the datasets are usually very large and collected in stream-fashion in reality.In the real world,running machine learning algorithms on massive streaming data faces three challenges:model evolution,data turbulence,and real-time inference.For this reason,the main research content and results are as follows:1.This dissertation studies the problem of efficient algorithms for stochastic vari-ational inference in distributed systems.In order to improve the efficiency of using stochastic variational inference to process massive amounts of data,this dissertation uses latent Dirichlet allocation(LDA)as an example,to investigate performance bottlenecks of MLlib(an official Spark package for ML)with the distributed Online-LDA algorithm.This dissertation demonstrates that the performance inferiority of Spark is caused by im-plementation issues rather than fundamental flaws of the Bulk Synchronous Parallel(BSP)model that governs Spark's execution:We can significantly improve Spark's performance by leveraging the well-known"model averaging"(MA)technique in distributed latent Dirichlet allocation algorithms in Spark,the performance of the MA-LDA algorithm is better.The implementation is not intrusive and requires light development effort.And this dissertation uses other system optimization strategies to further improve the convergence speed of the MA-LDA algorithm,making it several orders of magnitude higher than the existing algorithm on Spark.Experimental evaluation results reveal that the MA-LDA algorithm enables Spark's calculation speed to be comparable to the fastest latent Dirichlet allocation algorithm on other dedicated machine learning platforms,and the quality the model obtained after convergence is significantly better.2.This dissertation studies the theoretical convergence and communication effi-ciency for distributed stochastic variational inference algorithms.Based on experimental comparison,basic distributed stochastic variational inference(Online-LDA in Spark)has a higher communication cost than model average stochastic variational inference(MA-LDA).However,how to theoretically analyze and compare the convergence rate and communication efficiency of these algorithms is still a blank in research.To solve this problem,this dissertation proposes a formal analysis process,which can theoretically an-alyze the communication efficiency of various distributed stochastic variational inference algorithms.Based on this analysis process,this dissertation first derives that the basic dis-tributed stochastic variational inference algorithm has linear communication complexity O(T),where is the amount of data processed by each computing node.After analysis,this dissertation concludes that the model average stochastic variational inference algo-rithm has a sub-linear communication complexity O(T3/4).This dissertation not only fills in the research gap of theoretically analyzing and comparing different distributed stochastic variational inference algorithms,but also inspires researchers to design better algorithms with the improved communication efficiency in theory.3.This dissertation studies the communication interval in the average stochastic variational inference algorithm of distributed models.This dissertation discusses the advantages of the model average stochastic variational inference algorithm from both the application and theory aspects.At the same time,it can be seen that a reasonable set-ting of the communication interval is very necessary but hard.In order to overcome the disadvantages of the fixed communication interval,this dissertation designs a novel algo-rithm with dynamic communication intervals.The characteristic of this algorithm is that the communication interval decreases linearly as the model converges.This dissertation proves that this algorithm has the state-of-the-art convergence rate and communication complexity with theoretical guarantee.This algorithm forbids the disadvantages of the model averaging stochastic variational inference with fixed communication interval.The experimental results of using this algorithm to solve the latent Dirichlet allocation problem show its advantages.4.This dissertation studies the efficient training and real-time inference of dis-tributed stochastic variational inference algorithms for streaming data.In order to deal with the three challenges of real-world streaming data:topic evolution,data turbulence and real-time reasoning,this dissertation proposes a novel distributed stochastic varia-tional inference algorithm to solve the latent Dirichlet allocation problem on streaming data:StreamFed-LDA.The algorithm is implemented on a framework that supports life-long learning and can capture evolving topics on streaming data.On the other hand,the algorithm keeps historical information while learning the latest data features to deal with data turbulence.In addition,it introduces technologies that can reduce calculation and communication costs,thereby increasing the throughput and reducing the latency of the algorithm,and it can provide real-time inference when dealing with massive stream-ing data.This dissertation evaluates this algorithm on four real data sets.Experiments show that the performance of StreamFed-LDA is significantly better than other baseline algorithms,and the inference latency is reduced by several orders of magnitude.
Keywords/Search Tags:variational inference, distributed machine learning, communication complexity, machine learning on streams, latent Dirichlet allocation
PDF Full Text Request
Related items