Font Size: a A A

Research On Scheduling Techniques In Resource Utilization Constrained Big Data Processing Systems

Posted on:2023-06-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:S J WuFull Text:PDF
GTID:1528307172453264Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Efficient big data processing is an important supporting technology for obtaining the value of big data.Since the resources in the big data processing system are limited,making full use of the available resources through efficient scheduling is the basis for improving the system performance.The scheduling in the big data processing mainly includes the job scheduling and the data partitioning.The job scheduling assigns computing tasks to the computing resources.The data partitioning distributes data to different tasks during job processing.Existing big data processing systems greatly reduce the overhead caused by disk I/O through in-memory computing technology.Therefore,the factors restricting the system performance mainly include two aspects: whether the computing and network resources are fully utilized.However,the current big data processing system has the problem of limited utilization of computing and network resources.First,the allocation of available computing resources in the cluster is limited.Since there are complex dependencies between computing stages in a big data processing job,the system cannot schedule a computing stage until other stages on which it depends are completed.Available computing resources in the cluster are difficult to adequately allocate to jobs.Second,the utilization of computing resources on tasks is limited.Due to the skewed distribution characteristics of data in real systems,the existing hash-based data partitioning will lead to uneven load on different tasks in the same computing stage.Tasks with lower load complete faster,and after the execution is completed,due to the lack of data to be processed on the task,the idle computing resources cannot be used by the task,nor can it be released to other tasks in the same computing stage.Third,the utilization of network resources on the transmission link is limited.The existing big data processing system reduces the amount of data transmitted in the input stage by improving the data locality of tasks that process input data,so these tasks are unevenly distributed on different racks,which in turn leads to uneven loads on different network links during the phase that transfers intermediate data.The transmission on a link with lighter loads completes faster,and after the transmission is completed,due to the lack of data to be transferred on the link,the idle network resources cannot be utilized.In response to the above problems,the research on scheduling technique in big data processing systems is carried out around the core goal of "efficient resource utilization",which includes the following three aspects.Aiming at the problem of limited allocation of available computing resources in the cluster caused by task dependencies,a dependency-aware DAG job scheduling technique called Argus is proposed,which improves the computing resource utilization during job scheduling.First,by monitoring the job execution process,it is found that the inter-stage dependencies cause the system to have idle resources while the computing stage cannot be scheduled.The analysis reveals that the scheduling sequence between parallel scheduling stages will affect the resource utilization.Then,the idea of determining the scheduling order of stages based on the dependencies between stages is proposed.By making full use of the known job DAG structure information,the priority among multiple parallel scheduling stages is determined based on a heuristic algorithm,which improves the utilization of computing resources.Finally,Argus is implemented on top of Apache Spark and comprehensive experiments are conducted to evaluate the performance using large-scale traces collected from real-world systems.Results show that compared to state-of-the-art designs,Argus reduces the job completion time and makespan by 38% and31%,respectively.Aiming at the problem of limited utilization of computing resources caused by skewed loads on different tasks,a data skewness-aware differentiated data partitioning mechanism called Astraea is proposed,which improves the computing resource utilization during task execution.First,by analyzing the data distribution characteristics,it is revealed that the hash-based data partitioning strategy will lead to load skewness on different tasks.This in turn leads to the problem that tasks have idle computing resources but no data needs to be processed.Then,using the idea of pipeline parallelism,the principle of determining the partition strategy based on data frequency information is proposed.In order to avoid the overhead of globally collecting statistics of intermediate data,a differentiated data partitioning method is designed.For a small number of high-frequency keys,Astraea uses the input data statistics which is available in the batching phase;while for the remaining large fraction of low-frequency keys,Astraea uses the accurate local intermediate data statistics.Finally,Astraea is implemented on top of Spark Streaming and evaluated with multiple large-scale real-world datasets.Results show that Astraea reduces the degree of load skewness by 42% and improves the system throughput by 27% compared to the state-of-the-art design.Aiming at the problem of limited utilization of network resources caused by uneven loads on different transmission links,a network load-aware duplicate task scheduling strategy called Shadow is proposed,which improves the network resource utilization during data transmission.First,by detecting the execution time of different stages in the job,it is found that ensuring the data locality of map task in Map Reduce framwork will lead to a long execution time in the shuffle stage.The analysis reveals that the load on different cross-rack links during shuffle are uneven,which leads to the problem that links have idle network resources but no data needs to be transferred.Then,based on the principle of "power of choice",the method of using replica tasks to balance the link load is proposed.To balance the network link load,Shadow iteratively selects the original map task from the most heavily loaded rack and creates duplicate tasks for it on the least loaded rack.Shadow makes a choice between an original task and its replica by efficiently pre-estimating the job execution time.Finally,extensive experiments are conducted to evaluate the Shadow design.Results show that Shadow greatly reduces the cross-rack skewness by 36.6% and the job execution time by 26% compared to existing schemes.
Keywords/Search Tags:Big data processing, Batch processing, Stream processing, Job scheduling, Data partitioning
PDF Full Text Request
Related items