Research On Key Technologies For Scalable Machine Learning

Posted on:2019-11-26

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y W Ming

Full Text:PDF

GTID:1368330611992973

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In recent years,the proliferation of data volumes urgently requires research on key technologies for scalable machine learning,and the current rich computing resources provide opportunities for scalable machine learning.In order to realize scalable machine learning,this thesis starts with two technical approaches: efficient algorithm design and parallel and distributed methods,and conducts in-depth research on how machine learning can effectively deal with the challenges of big data and large models.Based on the co-design of algorithm and system,under the premise of ensuring accuracy,the speed of machine learning is effectively improved,the scalability of machine learning in computing and memory is enhanced,and the following research results are obtained:1.Two data and model parallel extreme learning machine methods are proposed,namely LDMP-ELM and GDMP-ELM.LDMP-ELM runs faster and is easier to implement,while GDMP-ELM can support larger models,complementing each other,collectively known as DMP-ELMs.DMP-ELMs are more scalable due to solving the computational and memory bottlenecks that exist in existing methods,capable of handling large-scale datasets and supporting a large number of hidden layer nodes.This is because DMP-ELMs use data parallel and model parallel methods to improve the parallelism of extreme learning machines,mainly by using matrix partitioning and distribution matrix computing,and making full use of the characteristics of randomly generating hidden layer parameters in ELM.DMP-ELMs are developed entirely on supercomputer-based software stacks such as MPI.In the experiment,we successfully extended the algorithm to128 nodes,which can effectively process the data and the model that are too large to load into the stand-alone memory.To the best of our knowledge,this is the first successful training of a large extreme learning machine model with 50,000 hidden layer neurons on a mnist8 m dataset with 8.1 million samples and 784 dimensional features.At present,most parallel and distributed machine learning researches focus on commercial computer clusters.There are few parallel and distributed machine learning researches based on supercomputing systems.This work is a research on parallel and distributed machine learning based on supercomputer hardware and software systems.2.A distributed and asynchronous stochastic gradient descent method based on variance reduction,namely DisSVRG,is proposed.DisSVRG combines the variance reduction technique and the asynchronous communication protocol,uses the variance reduction gradient to update the model parameters,and uses the asynchronous communication protocol to share the newly learned parameters among the cluster nodes.In addition,we propose an adaptive learning rate with an acceleration factor to further accelerate DisSVRG.At the same time,an adaptive sampling strategy is proposed,which greatly reduces the waiting caused by the straggler problem in the iterative process.On the other hand,after discovering that the supercomputing software stack performs the problem of low level of abstraction in parallel and distributed machine learning research,we migrate the parameter server computing framework that arises from commercial computing clusters to the supercomputing ecosystem.The parameter server uses MPI as its communication bottom layer,so that the two computing frameworks can be merged.In essence,the parameter server can be considered as a layer of encapsulation on the basis of MPI,providing a higher and more concise programming model suitable for machine learning.The analysis,use,and redevelopment of the parameter server greatly simplifies this research.3.Two distributed and scalable k-means clustering methods are proposed,namely Scalable Lloyd's K-Means and Scalable Mini-Batch K-Means.Both can extend beyond single-node computing and memory limits based on data parallelism and parameter server systems.The former can find a higher quality solution,while the latter can converge to a suitable solution more quickly.They all have good scalability and do fully in-memory computing.In addition,we propose a new aggregation method for Scalable Mini-Batch K-Means to make distributed clustering converge.A large number of experiments on four large-scale datasets show that our proposed algorithm has good convergence performance and can achieve almost linear speedup.For example,when using 16 CPU cores distributed over 4 compute nodes for calculation,a speedup of about 14 times is achieved.In this study,we use the parallel and distributed stochastic gradient descent optimization algorithm of the previous study to solve the k-means clustering problem,which is still proved the versatility of using parameter server in the supercomputing system.4.Two k-means clustering methods based on variance reduction,namely VRKM and VRKM++,are proposed.Specifically,we first propose a position correction mechanism to correct the cluster center drift problem when optimizing the k-means problem based on variance reduction,and use the constant learning rate to update the parameters in kmeans.On this basis,we propose a variance reduction of k-means,or VRKM.Further,we theoretically improve VRKM and reduce its computational cost,and then propose a new variance reduced k-means,namely VRKM++.Compared with VRKM,VRKM++can avoid calculating batch gradients,reduce the amount of calculations,and thus it is more efficient.Both methods are implemented serially on a single node of the supercomputer system.A large number of experiments show that the performance of our proposed VRKM and VRKM++ methods is better than the state-of-the-art methods,and obtain large-scale clustering speedup of about 2 times and 4 times,respectively.In this work,we start with efficient algorithm design and increase the scalability of k-means clustering without taking up more computing resources.5.An approximate k-means method,namely LSH k-means,proposed.Instead of building an index on the cluster center,it establishes a Locality-Sensitive Hashing(LSH)index on the sample points.Specifically,LSH k-means first establishes an LSH hash table so that sample points that are close to each other are more likely to be hashed into the same bucket.Then we use the cluster center as a query to query its potential neighbor sample points from the LSH table.LSH k-means then introduces an indicator matrix that converts the potential neighbor sample points of the cluster center into potential neighbor cluster centers of the sample points.Finally,each sample point can find its nearest cluster center under the guidance of the indicator matrix without calculating the distance to all cluster centers.In addition,we propose an automatic tuning strategy that automatically determines the hyperparameters of LSH k-means with the help of two metrics built on the indicator matrix.A large number of experiments on three datasets show that our proposed algorithm has good convergence performance and achieves significant acceleration.This work is implemented serially on a workstation,and continues to start from the perspective of efficient algorithm design to achieve the high scalability with low resource occupancy.

Keywords/Search Tags:

Machine learning, Scalable, Parallel and distributed, Algorithm design, Stochastic gradient descent, Computing framework

PDF Full Text Request

Related items

1	A Research And Application On Stochastic Gradient Descent Algorithm In Distributed Cluster
2	Research On Improving The Convergence Performance Of Stochastic Gradient Descent In Distributed Machine Learning
3	A Research Of Stochastic Gradient Descent Algorithm
4	Research On Distributed Stochastic Gradient Descent Algorithm
5	Algorithm Convergence Sensitivity Optimizing Technology For Distributed Machine Learning
6	The Reseach And Application Of Stochastic Gradient Descent And Dual Coordinate Descent Algorithm
7	Parallel Stochastic Gradient Descent Algorithm On Large-scale High-dimensional And Sparse Data
8	Research Of Stochastic Parallel Gradient Descent Based On Segmentation Random Disturbance
9	Optimization Algorithms Of Neural Networks Weights Based On Stochastic Gradient Descent
10	Research On Privacy Preserving Methods In Stochastic Gradient Descent