Font Size: a A A

Matrix Model-Based Cross-Platform Big Data Machine Learning System And Its Performance Optimization

Posted on:2018-03-18Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q LiuFull Text:PDF
GTID:2428330512997990Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Large scale machine learning and data analysis system has become a hot research topic in order to efficiently dig out the hidden value of the big data.An excellent large scale machine learning system should have high performance on data processing,supply programming abstraction to efficiently design machine learning algorithms,and be able to support the existing and future big data platforms.Matrix operations are widely used in machine learning and data analysis algorithms.Except the traditional single platform based on matrix like R and matlab,there have been many researches on the design and implementation of high-performance distributed matrix computing library,such as HAMA,ML-Matrix and Marlin.However,these libraries only provide basic matrix operations and cannot have the global optimization scheme.Besides,the performance of matrix operations in different platforms has its advantages and disadvantages,which is due to the characteristics of the platform,the computing logic and the matrix size.Obviously,it is very challenging for users like data scientists to specify the platform or combination of platforms for a given algorithm workflow to achieve the best performance.In order to work out the easy-to-use problem,the computing performance issues,from the programming model and framework,the computing flow graph optimization aspects and the overall system design,we design and implement a cross-platform large-scale machine learning system called Octopus,based on matrix programming model.The primary contributions of this paper are as follows:(1)Implementing cross-platform large-scale machine learning system based on matrix model.The system allows users to implement big data machine learning and data analysis algorithms based on matrix model with R language,and achieves good usability and programmability.(2)In order to improve the performance,we build the computing flow graph through the declarative matrix construction and implement the logical optimizations of the computing flow graph,including common subexpression elimination optimization and matrix multiply chain optimization.(3)Designing and implementing the physical optimizations of the computing flow graph to improve the performance,including Cache and Shuffle optimizations on Spark platform and automatically specifying the best platforms for the matrix operations on multiple platforms.(4)Implement the prototype system called Octopus based on the above key technologies.Octopus supports computing platforms such as R,Spark,Hadoop and MPI,provides declarative and imperative matrix interface in R language,and achieves good usability and programmability.Users without distributed programming knowledge can design the machine learning algorithms.The application just need to be written once,and with almost no modification it can be performed on any computing platform based on the real demand,achieving the "Write Once,Run Anywhere" cross-platform characteristic.(5)The experiments show that the logical optimization of Gauss non-negative matrix factorization(GNMF)algorithm achieves 1.91,1.31,1.23 times speedup in R,Spark,and MPI platform respectively.The physical optimizations for Spark platform of the GNMF algorithm can achieve 1.58-5.06 times speedup.The time model of the automatic scheduling framework has an error rate less than 10%.The automatic scheduling framework schedules the example application on Spark and MPI platforms,improving the performance about 91%and 62%respectively compared with the single Spark and MPI platform.
Keywords/Search Tags:matrix computing model, distributed machine learning system, cross platform, matrix computing flow graph optimization, automatic scheduling
PDF Full Text Request
Related items