Font Size: a A A

Optimization Of RDD Cache Mechanism On Spark Framework

Posted on:2022-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:M D YangFull Text:PDF
GTID:2518306575967069Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Apache Spark can cache the result data of Resilient Distributed Dataset(RDD)to memory or disk by using persistence mechanism,so as to avoid the consumption of frequent RDD recalculation in the case of complex application,multiple iterations or large data set.Spark's cache management mainly relies on the Least Recently Used(LRU)algorithm,but this algorithm does not consider the dependency relationship between RDDs,which may cause RDDs that are no longer dependent to occupy more cache space,so that RDDs that need to be used frequently in future work cannot be left in the cache for a longer time.In order to reduce the cost of recalculation when RDD is called again,this thesis optimizes the cache management mechanism of Spark,and proposes an elimination strategy based on cache value and an adaptive checkpoint mechanism.Specifically,the following work is completed:Firstly,aiming at the problem of low cache hit rate of LRU algorithm in the case of long-term and frequent reuse of RDD,this thesis constructs the cache value models of RDD and Block respectively according to the dependency relationship of RDD,and proposes an elimination strategy based on cache value-Least Cache Value(LCV)strategy.The strategy adopts active caching mechanism,which puts the RDD with high cache value into the cache when the Spark job is submitted,and clears the RDD that is no longer needed in the process of task execution.In addition to active caching RDD,LCV strategy can also replace the blocks according to the cache value when the node memory is insufficient to store the next cache data block,so as to ensure that the blocks reused frequently in the future work remain in the cache.Compared with LRU and LRC,the results show that the proposed LCV strategy can effectively improve the cache hit rate of Spark and reduce the task computing time.Secondly,in order to further improve the performance of LCV strategy,this paper proposes an adaptive dynamic checkpointing(ADC)mechanism based on cache value.According to the length of the Spark lineage relationship chain,the mechanism automatically determines whether to set checkpoints,and performs checkpointing operations on RDDs with high cache value every certain stage,so as to shorten the lineage relationship length,reduce the cost of RDD recalculation,and further release the memory space.The experimental results show that LCV strategy combined with ADC mechanism can reduce the fault-tolerant cost to a certain extent,and it is more suitable for working in the environment with more iterations and sufficient memory resources.To sum up,the LCV strategy and ADC mechanism proposed in this thesis can effectively improve the utilization of cache resources in Spark framework,reduce the RDD recalculation overhead and the user's manual setting burden,shorten the task execution time,and ultimately improve the overall performance of Spark framework.
Keywords/Search Tags:RDD, dependencies, cache value model, active cache, checkpoint mechanism
PDF Full Text Request
Related items