Font Size: a A A

Design And Implementation Of Machine Learning Platform Based On Spark

Posted on:2015-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:Z K TangFull Text:PDF
GTID:2268330428961660Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Accompanied by the development of technologies of cloud computing and distributed cluster, the concept of big data was extended widely and deeply in volume and value, and machine learning that plays an essential role in exploring big data was attracted unprecedented attention in recent years. Traditional data mining algorithms is incapable to deal with massive dataset. MapReduce has been successfully applied to many big data problems, however, it lacks the ability to efficiently support parallelized, iterative machine learning algorithms. To address the above problems, we propose a machine learning platform based on the emerging Spark framework, not only to process massive data efficiently, but also with a favorable scalability, which can satisfy the demand of many kinds of machine learning tasks.The contribution of this thesis are as follows:We develop a variety of machine learning algorithms based on Spark and theory of large scale machine learning, including parallelized linear regression, support vector machine, KMeans, matrix factorization and PageRank algorithms based on graph computing model, and KMeans in dataflow to achieve both high utility, scalability and efficiency.Some strategies are used in the implementation of the platform to improve and optimize performance for large scale datasets. For example, Bagging strategy based on ensemble learning theory are adopted to improve the stability of the model, and sub-gradient model optimization to promote the efficiency of model computation. And a variant of matrix factorization algorithm based on graph computing framework are suitable for extremely sparse ratings matrix in massive datasets. In addition, we implement algorithms with objected-oriented design methods for expendability. Design patterns such as Factory pattern and Strategy pattern are encapsulated in the framework.Followed the design of Lambda architecture, the platform are divided into three hierarchy. They are batch layer, service layer and dataflow layer. The batch layer are designed by the hybrid of Spark and Hadoop to model batch dataset. The service layer constructs indexes of the batch model to support parallel real-time requests. And the dataflow layer are mainly emphasized on streaming computation to model real-time dataset. The incoming requests will combine the batch and dataflow results into the final output.The performance of our algorithms in platform are verified by experiment results. Compared with serial algorithm on single computer and algorithms based on MapReduce, out methods have shown significant improvement in runtime, speedup ratio and throughput.
Keywords/Search Tags:Spark, Machine Learning, Massive Data Mining
PDF Full Text Request
Related items