Font Size: a A A

Research On Distributed Matrix Factorization Algorithm Based On Hadoop

Posted on:2019-06-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2438330548972616Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the past few years,the rapid development of science and technology has led to the explosive growth of data,we have entered the era of big data.Nonnegative matrix factorization as an efficient method,widely used for data dimensionality reduction and feature extraction,can effectively reduce the complexity of large-scale data,as well as showing the value of the data,but there are some disadvantages of complex calculation.As a distributed computing platform,Hadoop is active in the field of big data.It has been combined with many algorithms to form a new data processing mode,which further improves the efficiency of data operation.Among them,the Hadoop platform uses HDFS as the file system to store the data,and combines the MapReduce programming model to achieve remarkable results in the parallel processing of large-scale data.In this paper,firstly,based on learning various distributed platforms,we consider the possibility of running NMF on other distributed platforms by studying the NMF algorithm of MPI and OpenMP mixed programming.After studying the distributed platform Hadoop,the advantages and disadvantages of various NMF algorithms and matrix multiplication modes are compared.A new NMF algorithm called HNMF is proposed by combining the distributed platform Hadoop with the NMF algorithm,and using Hadoop to deal with the parallel capability of large-scale data and the characteristics of the data reduction of the NMF algorithm itself to achieve a higher acceleration ratio.In this way,the iterative renewal problem of NMF is efficiently completed,thus improving the computational efficiency of the algorithm.And compared with the implementation time and the acceleration ratio of the similar scale matrix of NMF algorithm and the hybrid programming of MPI and OpenMP,the feasibility and the higher acceleration effect are proved.Secondly,through the study of classical NMF matrix updating rules,according to the calculation method of matrix multiplication,andbased on the two proposed NMF algorithms called CNMF and TNMF,this article analyzes the overhead of matrix update phases and proposes the optimization scheme,and validates the speedup change when increases the computing node at some MapReduce phase by the experiments on Hadoop,as well as the changing law of the execution time revealed with the increase of the number of nonnegative elements in the matrix.
Keywords/Search Tags:Nonnegative matrix factorization, big data, Hadoop, parallel, distributed platform, matrix multiplication
PDF Full Text Request
Related items