Font Size: a A A

Research And Implementation Of Balanced Partition Method Based On Spark Computing Granularity Adjustment

Posted on:2022-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:X LiuFull Text:PDF
GTID:2518306731987859Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,a variety of Internet products have entered people's lives.The accumulation of user behavior has formed massive data.At the same time,the scale attributes and structural characteristics of massive data itself have brought new challenges to data processing.As a fast and general distributed computing engine,Spark is widely used in big data processing.Its memory based computing method improves the performance of the system.However,Shuffle process still has the problems of low resource utilization and skewed partition data.Aiming at the shortcomings of spark framework,this paper studies the optimization technology in the process of Spark Shuffle.In order to solve the problem of Shuffle performance in distributed data processing framework,this paper proposes a dynamic adjustment of granularity and partition strategy(DAGP)based on the comprehensive consideration of resource utilization and balanced partition for Shuffle optimization.It consists of three parts:intermediate data sampling,computing granularity adjustment and balanced partition strategy.Firstly,this paper proposes a sampling algorithm based on importance.In the process of generating the sampling step,the importance parameter is added to the key with sampling bias,which makes the step easier to be accepted by this sampling.B ased on the sampling results,the key with high frequency is defined as high-weight key.Secondly,this paper proposes a calculation granularity adjustment algorithm.According to the number of key value pairs in the sampling data and the cluster resources,the computing granularity is adjusted by changing the number of current stage partitions,so as to reduce the possibility of idle CPU and improve the resource utilization of the cluster.Finally,by analyzing the sampling data,the distribution of intermediate data is predicted,and different types of weighted keys are differentiated.A balanced partition strategy is proposed,including HWKP for high-weight keys and LWKP for low-weight keys,Based on the idea of weighted Round-Robin and efficient hashing,the partition strategy repartition the Shuffle data,effectively alleviates the problem of data skew,and achieves the purpose of load balancing.Finally,in order to reuse the existing task scheduling and memory management mechanism in distributed computing framework,this paper chooses to integrate DAGP into the existing distributed computing framework.In this paper,DAGP is implemented under the distributed computing framework Spark,which verifies the effectiveness of importance sampling and computing granularity adjustment algorithm.Three widely used benchmark tests Wordcount,Join and Page Rank are used to evaluate the performance and execution time of DAGP.The experimental results show that this strategy can effectively alleviate the data skew prob lem in large amount of data computing,reduce the data skew between partitions by 25%,and significantly shorten the processing time of the whole application by 30% in benchmark test.
Keywords/Search Tags:data sampling, data skew, data partitioning, distributed computing, granularity adjustment
PDF Full Text Request
Related items