Design And Implementation Of Machine Learning Platform Based On Spark

Posted on:2015-01-17

Degree:Master

Type:Thesis

Country:China

Candidate:Z K Tang

Full Text:PDF

GTID:2268330428961660

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Accompanied by the development of technologies of cloud computing and distributed cluster, the concept of big data was extended widely and deeply in volume and value, and machine learning that plays an essential role in exploring big data was attracted unprecedented attention in recent years. Traditional data mining algorithms is incapable to deal with massive dataset. MapReduce has been successfully applied to many big data problems, however, it lacks the ability to efficiently support parallelized, iterative machine learning algorithms. To address the above problems, we propose a machine learning platform based on the emerging Spark framework, not only to process massive data efficiently, but also with a favorable scalability, which can satisfy the demand of many kinds of machine learning tasks.The contribution of this thesis are as follows:We develop a variety of machine learning algorithms based on Spark and theory of large scale machine learning, including parallelized linear regression, support vector machine, KMeans, matrix factorization and PageRank algorithms based on graph computing model, and KMeans in dataflow to achieve both high utility, scalability and efficiency.Some strategies are used in the implementation of the platform to improve and optimize performance for large scale datasets. For example, Bagging strategy based on ensemble learning theory are adopted to improve the stability of the model, and sub-gradient model optimization to promote the efficiency of model computation. And a variant of matrix factorization algorithm based on graph computing framework are suitable for extremely sparse ratings matrix in massive datasets. In addition, we implement algorithms with objected-oriented design methods for expendability. Design patterns such as Factory pattern and Strategy pattern are encapsulated in the framework.Followed the design of Lambda architecture, the platform are divided into three hierarchy. They are batch layer, service layer and dataflow layer. The batch layer are designed by the hybrid of Spark and Hadoop to model batch dataset. The service layer constructs indexes of the batch model to support parallel real-time requests. And the dataflow layer are mainly emphasized on streaming computation to model real-time dataset. The incoming requests will combine the batch and dataflow results into the final output.The performance of our algorithms in platform are verified by experiment results. Compared with serial algorithm on single computer and algorithms based on MapReduce, out methods have shown significant improvement in runtime, speedup ratio and throughput.

Keywords/Search Tags:

Spark, Machine Learning, Massive Data Mining

PDF Full Text Request

Related items

1	A Frequent Serial Episode Mining Algorithm With Time Constraints Based On Spark Platform
2	Research And Implementation Of Unified Large Data Mining Service Platform Based On Spark MLlib
3	The Research Of High Efficient Data Mining Algorithms For Massive Data Sets
4	Design And Implementation Of The Massive Data Computing Platform Based On Spark
5	Research On SPARK Based Massive Data Frequent Pattern Mining Algorithms
6	Development Of Face Verification Algorithm On Massive Data Scenario Based On Metrix Learning And Distributed Machine Learning
7	Design And Implementation Of A Web Log Analytics Platform Based On Big Data And Machine Learning
8	Analysis And Research Of Machine Learning Model Based On Spark
9	Research On Reducing And Classifying Massive Data
10	Research Of Large-scale Data Mining Technology Based On Spark