Font Size: a A A

Research On Resource-aware Skew Mitigation For Mapreduce

Posted on:2017-11-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z H LiuFull Text:PDF
GTID:1368330569998434Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of cloud computing,Internet of things and social networking,a huge amount of data grows continuously.Efficiently and promptly analysing these large datasets so as to extract relevant information for decision making is becoming an issue of vital importance.MapReduce,a large-scale data processing framework,has gained much popularity.By breaking down a data-intensive job to a number of small tasks and executing them in parallel across multiple machines,MapReduce can significantly reduce the job running time.However,since the characteristics of the input datasets are unknown before the execution of the data-intensive job,it is difficult to achieve load balance while assigning the load among tasks.As a result,the tasks with heavy load will run slowly,thereby prolonging the completion time of the job.Here,we refer to the problem that the load among tasks is unevenly distributed as data skew.Many solutions have been proposed for mitigating data skew recently.However,these solutions still have some limitations.First,in terms of the estimating the reducer's load,a run-time efficient reducer workload estimation technique is still scarce.Second,existing solutions follow a similar pattern that repartitions the load among reducers.This breaks the primitive of MapReduce,and therefore incurs heavy overhead.Third,existing solutions focus on the partitioning skew,but cannot mitigate the computational skew.Therefore,in this dissertation,we focus on the data skew problem in MapReduce and propose a series of skew mitigation techniques by using the concept of dynamic resource allocation.The contributions of this dissertation are as follow:To begin with,we propose two run-time reducer workload estimation approaches:linear regression based and re-sampling based reducer workload estimation approaches.The former approach is based on the assumption that the generated sub-partition's size is linear related to the number of completed mappers.The latter approach leverages the trimmed mean and re-sampling techniques to improve the robustness.Thus,even if the precondition of the former approach is not retained,the latter approach can still produce accurate estimation.However,since the latter approach increases the sample size and needs to iteratively verify the estimating accuracy,this approach has higher overhead.The experimental results show that these two approaches can achieve no more than 17.5% and10.92% prediction errors in the worst case,respectively,when only 5% of mappers have completed.Secondly,by using dynamic resource allocation,we propose DREAMS,a job profile based partitioning skew mitigation approach.Instead of repartitioning the load among reducers,DREAMS dynamical allocates resource to reducers based on their load.More specifically,in order to determine the resource requirement of each reducer,we investigate the impact of task load and resource allocation on task execution time and propose a reduce task performance model.We refer to the task performance models for differernt applications as job profiles for these applications.DREAMS allocates resources based on the job profiles and the predicted task load,which can accelerating the skewed tasks,thereby mitigating the partitioning skew.The experimental results show that DREAMS can effectively mitigate the partitioning skew and significantly improve the job completion time.More specifically,DREAMS improves the job completion time by up to a factor of 2.29 over the native Hadoop YARN.Compared to the state-of-the-art solution,DREAMS can improve the job completion time by a factor of 1.65.Thirdly,most of the existing solutions are based on offline heuristics and cannot handle the partitioning skew online.To this end,we propose OPTIMA,a outlier detection based online partitioning skew mitigation approach.OPTIMA abandons job profiles and can mitigate the partitioning skew in an online manner.As a result,OPTIMA is not only applicable to routine MapReduce applications,but also suitable for applications that have not been executed before.The experimental results show that OPTIMA can effectively mitigate the partitioning skew and improve the job completion time by up to 32.58%.Lastly,since existing solutions neglect the computational skew and can only mitigate the partitioning skew,we propose DynamicAdjust,a technique that can mitigate both the partitioning and computational skew.DynamicAdjust detects skewed tasks by monitoring the task remaining time rather than predicting the task load.As a result,both the partitioning and computational skew can be detected by DynamicAdjust.Besides,DynamicAdjust adjusts resources for containers while containers are running.The experimental results show that DynamicAdjust can significantly improve the skew detection accuracy by up to47.64%.In addition,DynamicAdjust can effectively mitigate the partitioning and computational skew.More specifically,DynamicAdjust improves the job completion time by up to 40.85% in comparison to the native Hadoop YARN.Overall,this dissertation mainly focuses on the data skew problem in MapReduce.By using the concept of dynamic resource allocation,we propose a series of skew mitigation approaches from different perspective.The evaluation results confirm that our approaches can eliminate the negative impact of data skew,reduce the job completion time and save the cost of data processing.
Keywords/Search Tags:MapReduce, Hadoop, resource-aware scheduling, data skew, straggler, workload prediction, task performance model
PDF Full Text Request
Related items