Big Data System Optimization Under High-speed Network Environment

Posted on:2022-03-01

Degree:Master

Type:Thesis

Country:China

Candidate:Z H He

Full Text:PDF

GTID:2518306725993029

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid growth of the number of Internet users,the amount of data generated by users has also shown an exponential increase.Faced with the massive amount of data that needs to be stored and processed,more and more companies use distributed storage system and distributed computing system to deal with data as the prices of commercial servers continue to drop.Therefore,researchers and engineers open source a variety of distributed systems,such as Hadoop,Spark etc.,to meet the needs of different user.However,this part of the open source system does not consider the characteristics of the task itself when scheduling tasks,and even treats each task as the same individual,so that resources cannot be used well to accelerate the completion of the job.Moreover,due to the influence of the background of the times,the early distributed computing systems were mainly based on low-speed peripherals such as 1Gbps Network Interface Card,HDDs,and some papers were mainly aimed at the optimization of the early systems.However,with the rapid development of high-speed storage devices and high-speed networks devices in recent years,the performance bottleneck of some distributed systems has shifted,and early optimization work cannot gain.Therefore,this article combines the characteristics of the Hadoop MapReduce scheduler and the data skew phenomenon of big data applications,and proposes a simple scheduling optimization method to optimize Task scheduling;then,we analyze the performance bottleneck through some well-known big data systems link Hadoop MapReduce,Spark etc.,and explain the phenomenon of performance bottleneck transfer,and give a preliminary solution.The work of this article mainly includes the following two parts:?The first part:data skew and optimization.We first explain the data skew phenomenon by some big data applications,and analyze the reasons for this phenomenon.Then we use Largest-First to replace the original FIFO scheduling of Hadoop MapReduce and then theoretically analyze the two,Next,we explain our implementation scheme design and some problems that need to be solved,and finally prove the effectiveness of Largest-First experimentally.Our benchmark show that for some classic big data applications,the corresponding job completion time can be reduced by 6.47%,9.26%,and 13.66%respectively.?The second part:performance bottleneck of big data system and co-scheduling.We first a Coflow scheduling algorithms in Hadoop MapReduce,and demonstrate that the network scheduling algorithm cannot gain under the hardware of high-speed networks.Then through analysis on the source code of big data systems such as Hadoop MapRedude and Spark,and experimental verification of some Hadoop MapReduce applications,we explain why network scheduling algorithms fail and the phenomenon of performance bottleneck transfer.Finally,we propose to a solution:dynamic trade off between calculation and network,and conduct simple verification through preliminary experiments in Spymemcached/Memcahced.

Keywords/Search Tags:

Big Data System, Data Skew, Task Scheduling, Performance Bottleneck

PDF Full Text Request

Related items

1	Optimization And Research On Reduce Task Scheduling Strategy And Data Skew On Hadoop
2	Spark Task Scheduling With Data Skew And Deadline Constraints
3	The Research Of Scheduling Algorithms For Performance And Energy Consumption Under The Condition Of Data Skew
4	Research On Resource-aware Skew Mitigation For Mapreduce
5	Research On Partition Loading Balance Based On Spark Data Skew
6	The Elastic Resource Allocation And Task Scheduling Of Spark
7	Research And Optimization Of Adaptive Techniques For Mitigating Skew In Spark
8	Research Of Performance Optimization For Data Skew Based On High-speed Networks
9	The Research On IP Network End-to-End Performance Bottleneck Based On Active Measurement
10	Research Of Task Partition And Resource Allocation Algorithms For Load Balance In Spark Computing Environment