Font Size: a A A

Study And Implementation On Distributed Large Scale Matrix Computation Algorithms With Spark

Posted on:2017-04-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y TangFull Text:PDF
GTID:2308330485460892Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development and wide application of modern Internet, a large amount of valuable information emerge, which indicates the arrival of the Big Data era. How to find useful information from such massive data has become a great significant problem. Machine learning plays an essential role in exploring value from big data so that machine learning algorithms are major applications for many distributed data-parallel platforms in the Big Data era, in which matrix computation fundamentally dominates the efficiency and performance. As large-scale matrix computation cannot be completed in acceptable time on a single node, there is an increasing demand to seamlessly integrate large-scale matrix computation into general-purpose distributed data-parallel computing systems, such as MapReduce and Spark which provide scalability and fault-tolerance. To address above problem, we propose an efficient distributed matrix computation system built on Spark in this thesis.Firstly, we focus on dense matrix related computations, especially matrix multiplication and element-wise operators. Based on existing matrix multiplication execution strategy in distributed data-flow system, we propose a novel strategy, named C-RMM (Concurrent Replication based Matrix Multiplication), which could better balance computation parallelism and shuffle data size than any existing strategies. Moreover, we present three critical optimizations at three aspects, including Batch Calling Native Library (BCNL) to increase the concurrency of data exchange in MapMM (Map side Matrix multiplication) strategy, Slicing Matrix Transformation (SMT) to reduce the overhead of block matrix transformation, and Shuffle-Light sub-Matrix co-Grouping (SLMG) to reduce disk heavy shuffle operations by half by exploiting the semantics of matrix computation. Additionally, we use co-partition mechanization to optimize the element-wise operator performance significantly.Secondly, as matrix in many real application is sparse, dense storage format and computation in such situation may incur inappropriate overhead, we investigate sparse matrix related computations. After interpreting the difference and their own suitable scenarios of highly-sparse matrix and moderately-sparse matrix. We compare the state-of-the-art JVM-based local matrix computation libraries with our own local sparse matrix library. The result reveals that our library achieves better performance during distributed sparse matrix computation.Based on above research work, we have implemented an efficient distributed matrix computation system, Marlin, as a group of complete and easy-to-use matrix APIs, on the widely-used Spark system. The experimental results show that, for large-sized dense-dense matrix multiplication, Marlin outperforms the alternative systems Spark MLlib, SystemML and SciDB with about 2-4 times speedup. Moreover, for large-sized sparse-sparse matrix multiplication with widely used density, it achieves 8-16 times speedup; for large-sized dense-sparse matrix multiplication, it achieves 3-5 times speedup.Lastly, to evaluate the overall efficiency of our implementation, we build two real machine learning algorithms on top of Marlin. The evaluation upon a realistic DNN workload indicates that Marlin outperforms SystemML and SciDB by about 5-29 times speedup; and a realistic GNMF workload also shows Marlin outperforms SystemML and SciDB with 2-3 times speedup. Besides, the experimental results also show Marlin could achieve good scalability.
Keywords/Search Tags:Big Data, Distributed Dataflow System, Matrix Computation, Parallel Algorithms, Machine Learning
PDF Full Text Request
Related items