Font Size: a A A

The Optimization Research Of Spark Memory Allocation And K-means Algorithm

Posted on:2021-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:S S GengFull Text:PDF
GTID:2428330620963594Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Spark is a memory-based distributed data processing framework.It is widely used in data processing and analysis,machine learning,etc.The optimization research of the Spark platform has become the current research focus and hot spot.While using the Spark platform for data processing can improve the execution efficiency,reduce the data transmission time of operation,but its computation performance easily affected by many factors,for example,the underlying hardware,structure system,operating system and applications,resulting in low memory utilization in the Spark memory allocations and Spark MLlib K-means clustering algorithm in the low accuracy problem.Therefore,this paper mainly studies and improves Spark memory allocation and k-means optimization.The main research contents of this paper include:(1)Research on optimization of Spark memory allocationAiming at the problems of unfair memory allocation and low memory utilization caused by the difference of tasks in Spark platform,this paper proposes an optimized memory allocation method.Since Spark 1.6 adopted dynamic memory allocation,this paper optimized the memory allocation scheme in two parts.Firstly,four features of RDD partition were selected in the storage area.During the cache replacement,only the two most important features were selected each time through PCA dimension reduction,so as to ensure the generalization of the optimized cache replacement strategy.Then in the execution area,the memory allocation strategy of the execution area is optimized according to the memory needs of the Task and the memory space of the storage area.Finally,Finally,in the Spark Standalone mode,different experiments were used to verify the effectiveness of the optimization strategy.The experiments proved that the improved memory allocation strategy can improve cluster performance and task execution speed.(2)Research on K-means algorithm optimization in SparkThe clustering effect of the traditional k-means algorithm depends on the setting of the initial clustering center,and it cannot meet the actual needs of mass data processing.For this problem,this paper improves the K-means algorithm on Spark platform and realizes its parallelization.In this algorithm,the idea of the maximum and minimum algorithm is usedfirst to improve the value of the truncation distance in the density peak algorithm(DPC),so as to preprocess the data and find the density peak point,namely the initial clustering center.Then,the objective function is adjusted to minimize the intra-class distance and maximize the inter-class distance.Finally,by modifying the source code,the improved K-means algorithm and its parallelization are implemented on Spark platform.In this paper,a variety of data sets in UCI database are selected to verify the optimization algorithm.Experimental results show that the optimization algorithm can reduce the number of iterations and the running time of the algorithm,and improve the accuracy of clustering.
Keywords/Search Tags:Spark, Memory allocation, Cache replacement, K-means algorithm, Density peak algorithm
PDF Full Text Request
Related items