Font Size: a A A

Research On Job Scheduling And Memory Cache Optimization Based On SPARK

Posted on:2020-08-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y P ZhangFull Text:PDF
GTID:2428330575975782Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of cloud computing and big data technology,the Spark,as a large data processing framework based on memory computing,has been widely used.To improve the efficiency of task execution,the research on this platform has become a hot spot.In Spark,the data caching,reading and calculation are all carried out in memory,which greatly reduces the time of data transfer between memory and hard disk and improves the execution efficiency of tasks.However,to further improve the computational performance of Spark,exploring efficient job scheduling algorithm and making more efficient use of memory resources are still two hot issues to be solved.Therefore,for the improvement on the job scheduling algorithm and the memory usage mechanism of the Spark platform,we make corresponding research and the main research contents of this paper are as follows:(1)The research on Spark job scheduling based on genetic and tabu algorithm This paper adopts Spark On Yarn deployment mode and proposes a new job scheduling scheme for the shortcomings of several scheduling algorithms in the Yarn mode.By studying the evolution of genetic algorithm population,we proposed an improved optimal preservation strategy and a Modified Adaptive genetic algorithm(MAGA)for crossover and mutation operations.Furthermore,by merging MAGA and tabu algorithm,a Spark job scheduling scheme was proposed to modify the adaptive genetic tabu algorithm.Finally,it is proved that the job scheduling algorithm can effectively reduce task execution time and improve task performance.(2)An research and improvement of memory cache management based on SparkRDD is a unique abstract data model of Spark.Aiming at the selection of RDD cache and the improvement of LRU replacement algorithm,this paper proposes the RDD cache prediction mechanism,and proposes the weight model and weight update mechanism based on RDD partition characteristics by introducing entropy method,so as to achieve the goal of optimizing memory utilization.Finally,by building the cluster environment of Hadoop and Spark,we use the model of Spark On Yarn to compare the above two improvements.First,aiming at the modified adaptive genetic tabu algorithm,on the basis of verifying the effectiveness of the improved genetic tabu algorithm in the simulation environment,we further verify that the job scheduling algorithm can effectively reduce task execution time and improve task execution efficiency in the cluster environment.Then,the validity of the RDD cache prediction mechanism and the optimized replacement algorithm is verified in the same experimental environment.The experimental results show that this method can effectively reduce the task execution time and improve the memory utilization.
Keywords/Search Tags:Spark, RDD, genetic and tabu algorithm, weight updating mechanism, cache prediction mechanism
PDF Full Text Request
Related items