Font Size: a A A

Research On Parallel Manifold Learning Methods Based On Spark

Posted on:2016-05-17Degree:MasterType:Thesis
Country:ChinaCandidate:B YuanFull Text:PDF
GTID:2348330536986838Subject:Engineering
Abstract/Summary:PDF Full Text Request
In recent years,it is becoming more and more difficult to analyze data with the increasing data volume and data dimension.The research of big data field is widely concerned from academia and industry.Big data is the future trend of information technology development.In the environment with 100 billion big data,on the one hand the linear dimension reduction algorithms represented by parallel PCA cannot well preserve the topological structure of the original data and find the nonlinear features of data.On the other hand,the non-linear methods are the serial algorithms under the single machine,which cannot meet requirements of big data processing.For the above problems,some problems in the parallel manifold learning methods are discussed based on the distributed memory computing platform Spark and the existing serial manifold learning methods from the view of big data dimensionality reduction.The main contents and innovations in the paper are as follows:(1)The computing performance of the parallel manifold learning methods in MapReduce and Spark frame is analyzed and compared.Experimental results show that the performance of manifold learning algorithms based on Spark is superior to that based on MapReduce.(2)It will take longer time to solve the neighborhood matrix on big data set.To overcome the problem,a parallel algorithm to search neighborhood,which is based on p-stable distribution and the locality sensitive hashing method,is designed and implemented.The candidate neighbors are selected through clustering neighbor points according to the characteristics of the locality sensitive hashing function before calculating neighborhoods.Neighbors are determined through linear search from the candidate neighbors.Under guarantying accuracy,serial hashing functions are used and data are merged through Shuffle operations to improve the accuracy of neighborhood search.Experimental results demonstrate that the designed algorithm great improve the efficiency of searching neighborhoods.(3)To quickly solve the maximum eigenvalues and the corresponding eigenvectors of the matrix,a parallel eigenvalue solving method based on Spark is realized,which utilizes the power method and the order reduction method in turn to solve the eigenvalues and the eigenvectors.In the algorithm,the computing matrix is explicitly persisted into memory through fully using the caching features of Spark.The data in memory are directly used,which saves disk IO time.(4)To further improve the performance of the parallel algorithms,the calculating process is optimized based on the features of Spark.The memory consumption of computing the neighborhood matrix is reduced with the maximum heap data structure to select neighbors.The memory consumption is also decreased through using the sparse vectors to store the sparse matrix.Data transmissions are reduced through utilizing thebroadcast mechanism to transmit shared datasets.The iterative procedure speeds up with the caching mechanism.Experimental results on Swiss roll data demonstrate that the presented parallel manifold learning method can reduce the dimension of nonlinear big data and discover the nonlinear characteristics of the high-dimensional data,which has better performance.
Keywords/Search Tags:Manifold learning, Big data, Spark, LSH, P-stable distribution
PDF Full Text Request
Related items