Research And Implementation Of Balanced Partition Method Based On Spark Computing Granularity Adjustment

Posted on:2022-02-05

Degree:Master

Type:Thesis

Country:China

Candidate:X Liu

Full Text:PDF

GTID:2518306731987859

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology,a variety of Internet products have entered people's lives.The accumulation of user behavior has formed massive data.At the same time,the scale attributes and structural characteristics of massive data itself have brought new challenges to data processing.As a fast and general distributed computing engine,Spark is widely used in big data processing.Its memory based computing method improves the performance of the system.However,Shuffle process still has the problems of low resource utilization and skewed partition data.Aiming at the shortcomings of spark framework,this paper studies the optimization technology in the process of Spark Shuffle.In order to solve the problem of Shuffle performance in distributed data processing framework,this paper proposes a dynamic adjustment of granularity and partition strategy(DAGP)based on the comprehensive consideration of resource utilization and balanced partition for Shuffle optimization.It consists of three parts:intermediate data sampling,computing granularity adjustment and balanced partition strategy.Firstly,this paper proposes a sampling algorithm based on importance.In the process of generating the sampling step,the importance parameter is added to the key with sampling bias,which makes the step easier to be accepted by this sampling.B ased on the sampling results,the key with high frequency is defined as high-weight key.Secondly,this paper proposes a calculation granularity adjustment algorithm.According to the number of key value pairs in the sampling data and the cluster resources,the computing granularity is adjusted by changing the number of current stage partitions,so as to reduce the possibility of idle CPU and improve the resource utilization of the cluster.Finally,by analyzing the sampling data,the distribution of intermediate data is predicted,and different types of weighted keys are differentiated.A balanced partition strategy is proposed,including HWKP for high-weight keys and LWKP for low-weight keys,Based on the idea of weighted Round-Robin and efficient hashing,the partition strategy repartition the Shuffle data,effectively alleviates the problem of data skew,and achieves the purpose of load balancing.Finally,in order to reuse the existing task scheduling and memory management mechanism in distributed computing framework,this paper chooses to integrate DAGP into the existing distributed computing framework.In this paper,DAGP is implemented under the distributed computing framework Spark,which verifies the effectiveness of importance sampling and computing granularity adjustment algorithm.Three widely used benchmark tests Wordcount,Join and Page Rank are used to evaluate the performance and execution time of DAGP.The experimental results show that this strategy can effectively alleviate the data skew prob lem in large amount of data computing,reduce the data skew between partitions by 25%,and significantly shorten the processing time of the whole application by 30% in benchmark test.

Keywords/Search Tags:

data sampling, data skew, data partitioning, distributed computing, granularity adjustment

PDF Full Text Request

Related items

1	The Research Of Handling Data Skew In MapReduce Computing Model
2	Research Of Performance Optimization For Data Skew Based On High-speed Networks
3	Providing a Better Computing Experience While Preserving Data Privacy Through Hardware, Software and Data Partitioning in Pervasive Computing Environments
4	A Key-Value Skew Model Based Dynamic Data Partitioning Algorithm In Spark
5	Research On Optimal Reduce Placement Algorithm Based On Data Skew
6	Research On Data Skew Optimization In Spark Computing Framework
7	RSP:A New Approach For Approximate Big Data Analysis
8	An Intermediate Data Placement Algorithm For Load Balancing In Spark Computing Environment
9	Research On Partition Loading Balance Based On Spark Data Skew
10	Studies On Granularity Data Mining And Its Application In Process Industry