Font Size: a A A

Research On Elastic Resource Scheduling Strategy For Big Data Stream Computing

Posted on:2022-02-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Y LiFull Text:PDF
GTID:1488306539498024Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of new industry and new service model such as Internet of Things,electronic commerce,intelligent city and intelligent transportation,the total amount of generated data globally is increasing explosively and the real-time value of the data is becoming increasingly important,which makes big data stream computing a vital part of the data analyzing technology.Due to the characteristic of real time,volatility,bustiness,irregularity,and infinity,stream big data involved in tough challenges for analyzing data in a real time manner.In facing with the burstiness of the stream big data,existing processing platform suffers from the lack of elasticity and scalablity,which means the cluster is not able to response to the drastically fluctuate of the processing load result in the increase of processing latency and decrease of the throughput,which is the bottleneck of the further development in big data stream computing.Aming at these promblems,this desertation takes Apache Flink as major research objective,carry out a series of research on the elastic resource scheduling on big data steam computing as the basic to propose following elastic resource scheduling strategies.(1)A dynamic data stream load balancing strategy based on load awareness is proposed to reduce processing latency of the cluster.Focused on the issue that the load sharing between nodes in big data stream processing platform is unbalanced and the evaluation of the node performance is not comprehensive,a dynamic load balancing strategy based on load awareness algorithm(LBLA)was proposed and applied to Apache Flink,which includes load awareness algorithm and load balancing algorithm.Firstly,the computational delay time of the nodes was obtained by using the depth-first search algorithm for the Directed Acyclic Graph(DAG)as the basis for evaluating the performance of the nodes,and the load balancing strategy was created.Secondly,the load migration technology for data stream was implemented based on the data block management strategy,and both the global and local load optimization was implemented through feedback.Finally,the feasibility of the algorithm was proved by evaluating its time-space complexity,meanwhile the influence of important parameters to the algorithm execution were discussed.The experimental results show that the algorithm increases the efficiency of the task execution by optimizing the load sharing between nodes,the data processing latency of the cluster is reduced by averagely 6.51%.(2)A dynamic task dispatching strategy for stream processing based on flow network is proposed to make fully use of existing computing resource and improve the throughput of the cluster.Concerning the problem that sharp increase of the data input rate leads to the rising of computing latency which influence the real-time of computing in big data stream processing platform,a dynamic dispatching strategy based on flow network(FNDD)was proposed and applied to a data stream processing platform named Apache Flink,which includes capacity detection algoritm and maximum flow algorithm.Firstly,the Directed Acyclic Graph was transformed to a flow network by defining capacity and flow of every edge and the capacity detection algorithm was used to ascertain the capacity value of every edge.Secondly,the maximum flow algorithm was used to acquire the improved network and the optimization path in order to promote the throughput of cluster when the data input rate is increasing,meanwhile the feasibility of the algorithm was proved by evaluating its time-space complexity.Finally,the influence of the important parameter on the algorithm execution was discussed and recommended parameter values of different types of job were obtained by experiment.The experimental results showed that during the increasing phases of the data input rate in different types of benchmark compared with the original dispatching strategy of Apache Flink.The throughput of the cluster with FNDD is improved by averagely16.12% so the dynamic dispatching strategy efficiently promotes the throughput of cluster on the premise of task latency constrains.(3)A Flow-network based Auto Rescale strategy for Flink is proposed to improve the elasticity and scalability of the cluster in response for the dynamic fluctuation of the processing load.In order to solve the problem that the load of big data stream computing platform is increasing with fluctuation while the cluster is not able to rescale efficiently,the Flow-network based Auto Rescale strategy for Flink(FAR-Flink)is proposed,which includes flow-network building algorithm,elastic resource scheduling algorithm and state migration algorithm.Firstly,the flow-network model is set up and the capacity of each edge is calculated by self-learning algorithm.Secondly,the bottleneck of the cluster is acquired by maximum-flow algorithm and the resource rescheduling plan is drawn up.Finally,the resource rescheduling plan is executed and the stateful data is migrated efficiently by the data migration algorithm based on the strat-egy of data partitioning by bulk and bucket.The experimental results show that the strategy can effectively provide perfor-mance promotion in the application with complex stateful data.The average throughput of FAR-Fink is improved by averagely 27.61% for different types of benchamarks.It improved the throughput of the cluster and reduced the time overhead of the data migration on the premise of satisfying the latency constrain of the application,which means that the strategy promotes the scalability of the cluster efficiently.(4)A load prediction based auto rescale strategy in Flink is proposed to reduce the overhead and improve the real-time of elastic resource scheduling.In order to solve the problem that the load of big data stream computing platform fluctuates drastically while the cluster is suffering from the performance bottleneck due to the shortage of computing resources,the Load Prediction based Auto Rescale Strategy in Flink(LPAR-Flink)was proposed,which includes load prediction algorithm,online resource scheduling algoritm and online load migration algorithm.Firstly,the load prediction model was set up as the foundation to propose the load prediction algorithm and predict the variation tendency of the processing load.Secondly,the resource judgment model was set up to identify the performance bottleneck and resource redundancy of the cluster while the resource scheduling algorithm was proposed to draw up the resource rescheduling plan.Finally,the online load migration algorithm was proposed to execute the resource rescheduling plan and migrate processing load among nodes efficiently.The experimental results show that the strategy provides better performance promotion in the application with drastically fluctuating processing load.Compared with FAR-Flink,the average throughput of LPAR-Flink is improved by averagely 11.26%.The scale and resource configuration of the cluster responded to the variation of processing load in time and the communication overhead of the load migration was reduced effectively.
Keywords/Search Tags:Big data, Stream computing, Flink, Resource scheduling, Load balancing, Resilient cluster
PDF Full Text Request
Related items