Font Size: a A A

Research On Intermediate Data Balance Placement And Adaptive Cache Replacement Strategy In Spark

Posted on:2020-10-25Degree:MasterType:Thesis
Country:ChinaCandidate:X Q GongFull Text:PDF
GTID:2428330623467007Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nowadays,with the increasing of data size and data complexity,the Apache Spark is widely used in the industry for its high-performance caching mechanism and high scalability.When faced with data-intensive applications,it will cause problems such as uneven workload and invalid intermediate results of the Spark cluster due to the data placement characteristics of the data shuffling stage.How to reasonably place intermediate data in the data shuffling stage and formulate a reasonable caching strategy has become an urgent problem to be solved.Therefore,it is of great theoretical value and practical significance to study the intermediate data placement strategy in the data shuffling stage and cache replacement strategy in the Spark.According to the mentioned problems,this thesis mainly includes the following three aspects,(1)In order to improve the average execution time of Spark applications and load balance of reduce task,a data placement method based on reservoir sampling in the data shuffling stage is proposed to solve the problem of low overall efficiency caused by uneven workload of reduce tasks in the Spark.First,the proposed method estimates the input data distribution by randomly sampling the input data according to the appropriate sampling rate,applying a random sampling method based on the reservoir concept,and calculates the number of tuples in each data set.Then an index to measure the overall deviation of the size of the input cluster is proposed to divide the input data into two levels: slight skew and severe skew.Finally,for the slight skew of the input data,a coarse-grained data placement algorithm that does not divide the data set is designed.This algorithm improves the overall efficiency of the system by sorting and scheduling data sets.For the severe skew of the input data,a fine-grained data placement algorithm based on splitting data sets is designed.This algorithm implements load balancing of reduce tasks by splitting and merging data sets.(2)In order to improve the average execution time of Spark applications,system memory utilization and cache hit rate,an adaptive cache replacement algorithm based on maximizing the cache gain is proposed to solve the problems of inefficiency and resource waste caused by the useless intermediate results in the Spark.First,we analyze the dependence of various operations in the DAG and propose a cache gain model to measure the cache gain,and the goal is to maximize the cache gain.Then,when the job arrival rate is known,the rounding optimal approximate solution by backpack constraints is obtained by using the Pipage rounding method.Finally,when the job arrival rate is unknown,the projection gradient rising method is used to obtain the probability that each cache node should be placed in the cache,so as to obtain the online adaptive cache replacement strategy that maximizes the cache gain.(3)The performance of the proposed algorithm is obtained by comparing it with the existing algorithms.In the experiment of data placement strategy based on reservoir sampling,the experimental results are as follows.When the input data is skewed to a small extent,the average execution time of the coarse-grained data placement algorithm proposed in this thesis has obvious advantages compared with the DEFH algorithm and the RANGE algorithm,and the load balancing degree of the reduce tasks is also greatly improved.When the input data is skewed to a large extent,the average execution time of the fine-grained data placement algorithm proposed in this thesis is greatly improved compared with the SCID algorithm and the RANGE algorithm.and the load balancing degree of the reduce tasks is significantly improved compared with the SCID algorithm and the DEFH algorithm.In the experiment of adaptive cache replacement strategy based on maximizing cache gain,the cache replacement algorithm proposed in this thesis is compared with LRU algorithm,LCS algorithm and LRC algorithm.The results show that the average execution time of the cache replacement algorithm proposed in this thesis has obvious advantages compared with the LCS algorithm and the LRU algorithm,and the LRC algorithm is also significantly improved in terms of memory utilization and cache hit ratio.
Keywords/Search Tags:Data shuffling phase, data set, intermediate data placement, cache gain, adaptive cache replacement
PDF Full Text Request
Related items