In the last decades there was an explosion in data with the production of more data than in all of human history, which we call it the era of big data. As one of the key foundations for big data application, the extreme-scale data processing system has never been urgently demanded as the present. Google’s Map Reduce computation framework has proven itself as a simple yet effective programming model serving for processing large data sets in a massive parallel manner. The Map Reduce breaks a data processing job into many small tasks and run them in parallel on different nodes in a cluster. Consequently,to maximize the performance of a data processing system, the scheduling of tasks has attracted extensive attention in recent years.Existing research on Map Reduce mainly focuses on topics like task division algorithm, fault tolerance mechanism, job execution time prediction and scheduling strategies. However, with the increase of system scale, traditional task division and scheduling methods can no longer meet the demand of processing big data. On the one hand, the fault tolerance of such systems requires a more flexible and reliable consideration. On the other, the intrinsic characteristics of big data reshape the task division and scheduling a novel problem in the Map Reduce framework. As a consequence, many potential problems emerge while conducting big data application under the current Map Reduce framework: Firstly, most work in current Map Reduce framework propose to use a model based method for the prediction of task execution time, which then is used as an input for task scheduling. However, in the context of large scale system, a complex prediction model always implies a high computation cost, while a simple one turns to be inappropriate because of its low accuracy. In addition, existing scheduling algorithms rely on a precise task execution time as input. However, the execution time of a task is always accompanied with uncertainties in real applications. To make things worse, there is a growing trend in such uncertainties with the increase of system scale and complexity, which will lead to a bottleneck of system performance. Thirdly, current task division and scheduling method do not take into account the impact of data characteristics on the task execution time. However, in real applications the partial features of data(e.g. data skew) can bring the uneven workload distribution among the tasks and will slow down the whole response time. Lastly, the locality of data, if not well coped with, will cause ineffectiveness in Reduce task scheduling. Consequently, an increase in network traffic and complexity in data transfer is unavoidable. Based on the above observations, this thesis systematically investigates the partitioning and scheduling in big data processing system, and mainly focuses on the following topics:Based on the observation that the model based task execution time prediction method is inappropriate for big data processing system, this thesis analyzed the characteristics of Map Reduce framework and presented a new method Risk I. Risk I is inspired by two observations: To begin with, in an extreme-scale data processing system, jobs are periodically executed, thus history information of a task can be collected to predict its future execution time. Secondly, given uncertainty as an intrinsic feature of a task execution time, the riskmanagement theory is introduced in the assignment of tasks. In such a way, the system throughput is maximized with a constrained performance loss brought by uncertainties with a guarantee from the risk-management point of view.To address the straggler problem caused by data skew, we proposed a novel speculative execution mechanism(i.e., Skew Seize) that automatically identifies the causes of straggler and makes a choice between moving straggler and non-straggler. To begin with,when finding a reduce straggler, we compare the task’s input size with other reduce tasks in a job. The input size which reflects the uneven workload is collected from the network traffic when shuffling the intermediate value. By this step, the skewed task is recognized from straggler. Further, we proposed a more flexible speculative execution algorithm which makes a decision between moving straggler and non-straggler. Specifically, if the straggler is hosted on a slow node, it is then moved to other nodes. If, on the other hand, the straggler is caused by data skew, then the non-straggler is moved to other nodes. The effectiveness of this speculative execution mechanism is verified on a private cluster, and is shown to be more efficient than conventional methods.Towards an effective and efficient skew-free scheme, we proposed a closed-loop mechanism called Skew Control for automatic task scheduling and partitioning. Specifically, empirical studies for the data skew phenomenon were conducted on three real-life data traces, i.e., a record for road traffic, a record for cell phone calling history, and a record for metropolitan transportation smartcards. Further, the harmful effects of the data skew on system response time are quantified and analyzed. Through empirical analysis in real applications, several stable and predictable patterns in data skews are exhibited,on which we proposed the Skew Control strategy composed of three major components:a historical information based data skew shaping and prediction algorithm, a capacityaware task partitioning and scheduling algorithm, and a feedback controller mechanism for scheduling refinement. In such a way, even in the situation of a changing environment, the Skew Control is able to refine the task partition and optimize the scheduling performance.To deal with the degradation of system performance caused by data locality; this thesis first investigated the potential changes of data transfer at the Shuffle stage brought by different scheduling mechanisms on the Reduce task. Different inter- and intra- cabinets data transfer patterns were extracted. Further, a task assignment scheme named as Jinking was proposed to schedule tasks in the Map Reduce framework to avoid network congestion in a pro-active manner. Specifically, instead of minimizing the total cost, Jinking schedules reduce tasks in a way of minimizing the max traffic cost of each shuffle.In addition, in the situation when temporal data is(partially) unknown, delay scheduling and immediate optimal scheduling is used to decrease the data transfer in the network and avoid the congestion in an active way.To sum up, this thesis presents solutions to several essential issues in task partitioning and scheduling in big data processing system. Comprehensive experiments demonstrate the effectiveness of proposed methods, and show their theoretic and applicable value in the research of large scale data processing system. |