Font Size: a A A

Research On Spark Performance Optimization Technology For In-Memory Computing

Posted on:2021-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:C LiuFull Text:PDF
GTID:2428330614958437Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Big data platform Spark has gradually become a hotspot in recent years,its characteristic of In-memory computing can provide excellent performance in iterative application scenarios such as machine learning and neural network,and it has been used in many companies such as Baidu,Meituan,Tencent and Alibaba.The largest Spark cluster has thousands of nodes and terabytes of memory,the amount of data is up to PB level.However,nodes in Spark cluster often have high heterogeneity due to geographical differences,configuration updates and cluster expansion.How to better use the resources which in the cluster to improve the performance of platform has become the hottest research direction.This thesis focuses on two aspects: task scheduling and cache replacement,researched and improved the performance optimization technology of Spark.The main work can be divided into two parts,one part is the task scheduling mechanism based on improved quantum ant colony algorithm,the other part is the cache replacement and preload mechanism based on RDD weight and dual queues.The work of the above two parts will be described in detail:1.Propose the task scheduling mechanism based on improved quantum ant colony algorithm.The default task scheduling mechanism in the platform can't fully utilize the hardware advantages of high-performance nodes in cluster,resulting in problems of unbalanced task allocation and frequent memory spills.So,firstly consider CPU performance,memory capacity,CPU utilization,and memory utilization and network transformation speed comprehensively,and combined with the possibility of memory spill to calculate the memory spill sign to design a task completion time measurement method in a heterogeneous cluster.Then improve the quantum ant colony algorithm,use the task completion time measurement method in the heterogeneous cluster to measure the individual's fitness,use a max-min quantum pheromone update principle to control the range of the quantum pheromone probability amplitude,and avoid plateau by using dynamic catastrophe strategy.The experimental result shows that the task scheduling mechanism based on improved quantum ant colony algorithm proposed in this thesis can effectively improve system performance,save 10.9% in task finish time compared with existing improved algorithms,and reduce the number of memory spills by 17.9%.2.Propose the RDD Weights and Dual Queues based Cache Replacement and Preload mechanism(WDQCRP).Spark use the LRU algorithm as its default cache replacement mechanism,but the LRU algorithm only considers the time when the RDD was recently accessed during data block eviction,and evicts the RDD with high importance,which will bring a large recomputation time overhead.In view of the above problem,scholars have proposed weight replacement algorithm,but there are still some problems like the calculation of weights is not comprehensive enough,leading to inaccurate weight calculations.Therefore,this thesis proposes a Load Prediction based Weight model named LPW,it considers the number of RDD reuse times,the size of the RDD partition,the cost of the RDD partition calculation,the expected life cycle of RDD and the load prediction of RDD's reuse time.Based on the RDD weights and the idea of dual queues,the WDQCRP mechanism is designed.When the memory space is insufficient,WDQCRP mechanism can determine which RDD will be replaced in memory and this RDD should be cache in disk or not.In order to avoid disk I/O latency,the WDQCRP mechanism will preload the RDD in memory when the memory space is sufficient.The experiment result shows that the WDQCRP mechanism proposed in this thesis can effectively optimize Spark performance,save 8.02% in the task finish time compared with existing improved algorithms,and increase the ratio of RDD access hits by 9.59%.
Keywords/Search Tags:Spark, quantum ant colony algorithm, task scheduling, cache replacement, preload
PDF Full Text Request
Related items