Font Size: a A A

OPS:Optimized Shuffle Management For Distributed Computing Frameworks

Posted on:2021-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:Z X WuFull Text:PDF
GTID:2518306503474144Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,distributed computing frameworks,such as Hadoop MapReduce and Spark,are widely used for big data processing.With the explosive growth of the amount of big data,companies tend to store intermediate data of the shuffle stage on disk instead of memory.Therefore,intensive network and disk I/O are both involved in the shuffle stage.Also,to improve parallelism and reduce straggler tasks,both the academic and the industry recommend splitting one computing job into a large number of small tasks.However,our research found that as the number of small tasks increases,the number of disk and network I/O requests would increase quadratically.This further exacerbates the performance overhead of the shuffle stage.To evaluate the performance of distributed computing frameworks in different hardware environments and resource scheduling strategies,we present the Framework Resource Quantification(FRQ)model.the FRQ model can predict the execution time of distributed computing jobs and evaluate the performance of resource scheduling strategies they took.To optimize the overhead of the shuffle stage,we propose OPS—an open-source distributed computing shuffle service based on Spark.OPS provides an independent shuffle service for Spark.By using pre-merge and pre-shuffle strategy,OPS alleviate the I/O overhead in the shuffle stage and efficiently schedule the I/O resource and the computing resource.OPS also proposes a slot-based scheduling algorithm to predict and calculate the optimal scheduling result of the reduce task.Besides,OPS proposes a taint-redo strategy to ensure the fault tolerance of computing jobs.We evaluate the performance of OPS on a 100-node AWS EC2 cluster.Overall,OPS optimizes the shuffle overhead by nearly 50%.In the test case of HiBench,OPS improves end-to-end completion time by nearly 30% on average.
Keywords/Search Tags:Distributed computing, Shuffle, Performance optimization
PDF Full Text Request
Related items