Font Size: a A A

Research On Task Scheduling Optimization Methods For Big Data Stream Computing Framework

Posted on:2019-05-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:L LuFull Text:PDF
GTID:1368330566966585Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of technologies and industries such as Internet of things,e-commerce,intelligent transportation system and virtual reality,global data has begun to show a trend of explosive growth,and real-time streaming big data has become a key component of big data.Due to the five characteristics of real time,volatility,burstiness,irregularity and infinity,streaming big data cannot be analyzed by traditional big data batch computing frameworks.In such circumstances,big data stream computing frameworks came into being and quickly became the preferred solution for streaming big data processing,however,the default round-robin task scheduling mechanism does not take into account the differences of performance and load among distinct of work nodes,the different communication overhead,the scalability and other critical factors,which cannot fully exploit the high performance of stream computing cluster itself.To solve these problems,this dissertation takes the most popular stream computing framework—Apache Storm as the research objective and proposes the following task scheduling optimization methods.(1)An associated tasks-aware task scheduling strategy in stream computing framework is proposed.A round-robin scheduling strategy is used as the Storm default scheduler,without considering the differences of the work node configurations and the different kinds of communication modes between the tasks.In order to solve this problem,on the basis of building the Storm basic model,the quantity of tasks constraint model and the optimal data stream communication model,an Associated Tasks-Aware Task Scheduling Strategy in Storm(ATA-Storm)is proposed.This strategy firstly obtains each component according to the hierarchical sequence of the topology,and then initially deployed them to the work nodes where the data source resides according to the task localization principle.Moreover,on the basis of considering the remaining capacity of each work node,tasks in the Bolt component are deployed to the nodes where their upstream tasks are already located as many as possible,in order to minimize communication overhead on the premise of satisfying the fairness of task assignment.In the experiment,a non-linear topology which has two different kinds of data sources is used to develop the comparison experiments on the heterogeneous Storm cluster,and the experiment results show that the proposed strategy outperforms the offline scheduler in scheduling results,communication overhead and latency.(2)A task scheduling strategy based on weight in stream computing framework is proposed.The static task scheduling strategy is not suitable for complex stream grouping and changeable application scenarios due to without acquiring the real-time load of tasks and size of data streams.To solve this problem,a Task Scheduling Algorithm based on Weight in Storm(TSAW-Storm)is proposed on the basis of the weighted topology model,the load balancing model and the optimal communication overhead model.The algorithm introduces the idea of graph partition.Firstly,the algorithm takes CPU resource occupation as the weight of a task in a specific topology,and similarly takes the real-time tuple rate between a pair of tasks as the weight of a data stream.Then tasks are assigned to the most suitable work node gradually by the idea of maximizing the gain of data streams via transforming inter-node data streams into intra-node ones as many as possible with load balance and task localization principle ensured in order to reduce the overhead of network transmission.Experimental results show that in the WordCount benchmark of homogeneous cluster environment,the proposed algorithm has improved latency,communication overhead and load balance compared with the Storm default scheduler and online scheduler,meanwhile,the executive overhead is significantly reduced.(3)A task migration strategy in stream computing framework is proposed.Most of the dynamic scheduling strategies in the existing stream computing framework require the redeployment of tasks during the topology operation,which inevitably causes a pause during the normal operation of the topology and leads to a large executive overhead accordingly.To solve this problem,a Task Migration Strategy for Heterogeneous Storm Cluster(TMSH-Storm)is proposed on the basis of establishing and proofing the resource-constrained model and the task migration model.Firstly,source node selection algorithm adds work nodes which exceed the threshold to a set of source nodes according to the workload and priority of CPU,memory and network bandwidth in each work node;Task migration algorithm takes into account various factors such as the migration overhead,communication overhead,resource constraint as well as load of each node and each task,migrating the tasks that from source nodes to proper destination nodes successively and asynchronously.This strategy also analyzes the gap between the task local migration strategy and the task global redeployment strategy from the aspects of execution process and execution result on the theoretical level,and proves that the task migration strategy can achieve higher performance improvement with a lower executive overhead.Finally,four benchmarks are conducted on the heterogeneous Storm cluster.Experimental results show that the task migration strategy can effectively reduce the latency and overhead of inter-node communication,moreover,the executive overhead is lower compared with the existing research,which realizes a smooth and lightweight method of task scheduling.(4)A resilient cluster construction method for stream computing based on task migration strategy is proposed.When the cluster resources are insufficient or surplus,it is particularly important to scale the cluster dynamically and effectively.During the scaling process of work nodes,existing studies have caused a certain impact on the running topologies.To solve this problem,an idea of task migration is introduced,an improved resource constraint model and an expanded task migration model are built,and then a resilient cluster construction method for stream computing based on task migration strategy is proposed.With the aid of the domino effect of task migration and the combined effect of task migration,Dynamic Cluster Growing Algorithm Based on Task Migration Strategy(DCGA)can automatically increase the number of work nodes when the resources of the cluster are insufficient,and then migrate some appropriate tasks from the overloaded work nodes to the newly-added ones.When a node in the cluster has excess resources,tasks on it are automatically migrated to other work nodes and the original node can be shut down by using Dynamic Cluster Shrinking Algorithm Based on Task Migration Strategy(DCSA).Experimental results show that in the WordCount benchmark deployed on the heterogeneous cluster,the proposed method realizes the smooth scaling mechanism of the stream computing cluster,which reduces the latency and improve the reliability of tuple processing.
Keywords/Search Tags:Big Data, Stream Computing, Storm, Task Scheduling, Task Migration, Resilient Cluster
PDF Full Text Request
Related items