Font Size: a A A

Research On Memory Data Management Technology In Spark

Posted on:2021-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:C L ZhaoFull Text:PDF
GTID:2518306107969279Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Spark is an open source big data processing platform.Its core is Resilient Distributed Datasets(RDD)which is stored distributedly in the memory of the cluster to improve application execution efficiency.With the development of the big data era,data is showing explosive growth.More and more enterprises use Spark to provide data processing services.However,in actual applications,it is found that when the bottleneck of memory size is relative to the size of data,the stability and efficiency of task execution are much lower than Map / Reduce,and even causing Spark to crash,which is closely related to the memory data management of Spark.Therefore,how to improve the memory resource utilization rate and the task execution efficiency of Spark cluster in terms of memory caching data management technologies has certain research value and practical significance.The memory data management of Spark does not provide an automatic selection mechanism for cache objects,and in case of insufficient memory,LRU is used for cache replacement which without considering Spark data characteristics,affecting the task execution efficiency.The memory data management of Spark only provides sharing cache data in a single application,resulting in multi-application cache data redundancy.In view of the above problems,this paper mainly studies the management technology of memory cache data in Spark cluster from the aspects of cache object selection,replacement and cache data sharing,and its contributions include:1.For the uncertainty of Spark cache object selection and the unreasonable choice of replacement objects,which increase the task execution time,an adaptive caching mechanism for RDD memory data is proposed to optimize the RDD cache.The mechanism preferentially selects RDDs with high reusability and high computational cost as cache objects and replaces LRU with a minimum weight replacement algorithm.In the algorithm,considering the parallel computing characteristics of RDD partitions,adds a full reference count influence factor to the weight calculation,and uses linear weighted accumulation Methods to make the RDD partition weight value more accurate,so as to improve the accuracy of cache replacement object selection;dynamically adjust the relevant factor value according to the task execution situation,so that the cache replacement can adapt to changes in the task execution process.It is proved through control experiments that this mechanism can effectively reduce task execution time and improve Spark computing performance.2.For the problem that the same data existing between different applications is cached,resulting in the redundancy of cached data and the waste of memory resources,based on the existing research,the multi-application shared memory data space is used for improvement,a sharing mechanism for RDD memory data is proposed.First,a master-slave memory data management architecture is used to uniformly manage the cache data in the cluster to provide information support for data sharing.Identifying the RDD that performs the same task in different applications through the memory data sharing system,and rewrite the DAG according to the data caching situation.In order to prevent the rewritten DAG from waiting for execution,the required cache data is eliminated due to insufficient memory,consider the data reference in multiple applications,and use the entropy method to improve the minimum weight replacement algorithm to ensure Integrity of shared data.Experimental results show that this mechanism can improve the utilization of memory resources of the cluster and effectively reduce the execution time of jobs.From the accuracy of cache and replacement object selection,memory resource utilization,and job execution time,it can be seen that the research work in this paper is of great significance for Spark big data processing.
Keywords/Search Tags:Spark, Parallel computing, Resilient distributed dataset, Self-adaptive, Cache, Date share
PDF Full Text Request
Related items