Matrix Model-Based Cross-Platform Big Data Machine Learning System And Its Performance Optimization

Posted on:2018-03-18

Degree:Master

Type:Thesis

Country:China

Candidate:Z Q Liu

Full Text:PDF

GTID:2428330512997990

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Large scale machine learning and data analysis system has become a hot research topic in order to efficiently dig out the hidden value of the big data.An excellent large scale machine learning system should have high performance on data processing,supply programming abstraction to efficiently design machine learning algorithms,and be able to support the existing and future big data platforms.Matrix operations are widely used in machine learning and data analysis algorithms.Except the traditional single platform based on matrix like R and matlab,there have been many researches on the design and implementation of high-performance distributed matrix computing library,such as HAMA,ML-Matrix and Marlin.However,these libraries only provide basic matrix operations and cannot have the global optimization scheme.Besides,the performance of matrix operations in different platforms has its advantages and disadvantages,which is due to the characteristics of the platform,the computing logic and the matrix size.Obviously,it is very challenging for users like data scientists to specify the platform or combination of platforms for a given algorithm workflow to achieve the best performance.In order to work out the easy-to-use problem,the computing performance issues,from the programming model and framework,the computing flow graph optimization aspects and the overall system design,we design and implement a cross-platform large-scale machine learning system called Octopus,based on matrix programming model.The primary contributions of this paper are as follows:(1)Implementing cross-platform large-scale machine learning system based on matrix model.The system allows users to implement big data machine learning and data analysis algorithms based on matrix model with R language,and achieves good usability and programmability.(2)In order to improve the performance,we build the computing flow graph through the declarative matrix construction and implement the logical optimizations of the computing flow graph,including common subexpression elimination optimization and matrix multiply chain optimization.(3)Designing and implementing the physical optimizations of the computing flow graph to improve the performance,including Cache and Shuffle optimizations on Spark platform and automatically specifying the best platforms for the matrix operations on multiple platforms.(4)Implement the prototype system called Octopus based on the above key technologies.Octopus supports computing platforms such as R,Spark,Hadoop and MPI,provides declarative and imperative matrix interface in R language,and achieves good usability and programmability.Users without distributed programming knowledge can design the machine learning algorithms.The application just need to be written once,and with almost no modification it can be performed on any computing platform based on the real demand,achieving the "Write Once,Run Anywhere" cross-platform characteristic.(5)The experiments show that the logical optimization of Gauss non-negative matrix factorization(GNMF)algorithm achieves 1.91,1.31,1.23 times speedup in R,Spark,and MPI platform respectively.The physical optimizations for Spark platform of the GNMF algorithm can achieve 1.58-5.06 times speedup.The time model of the automatic scheduling framework has an error rate less than 10%.The automatic scheduling framework schedules the example application on Spark and MPI platforms,improving the performance about 91%and 62%respectively compared with the single Spark and MPI platform.

Keywords/Search Tags:

matrix computing model, distributed machine learning system, cross platform, matrix computing flow graph optimization, automatic scheduling

PDF Full Text Request

Related items

1	Research Of Matrix Factorization Parallelization Based On Graph Computing Model
2	Hybrid Graph Query And Graph Computing Engine For Distributed Graph Database
3	Some Problems Of Matrix Computing And Applications In Image Recognition
4	Parallel Massive Data Processing Platform Based On Graph Computing
5	Antnest: A Distributed Computing System Supporting Multiple Computational Modals
6	Study On Some Theories Of Dna Computing
7	Researching And Implementation Of The Intelligent Scheduling Strategy For Distributed And Parallel Computing
8	Research On Secure Outsourcing Algorithms For Large-scale Matrix Operations
9	A Distributed Computing Framework For Large-Scale Information Network Mining
10	Study On Secure Outsourcing Schemes Of Matrix Computing In Cloud Computing Environment