Font Size: a A A

The Optimazation Of Network Performance Of Spark Streaming By Using The PF_Ring Zero_copy Technology

Posted on:2018-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:F LiuFull Text:PDF
GTID:2348330542983630Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Spark is a new large data computing framework based on MapReduce.It not only inherit the linear scalability and fault tolerance of MapReduce,but also made a heavyweight expansion.Its resilient distributed datasets(RDD)can store the intermediate results of computation in memory,which effectively reduce the time to read and write data from the disk IO.So Spark is very suitable for a large number of iterative algorithms,and also can take into account batch and real-time data analysis.Spark has formed a sound ecosystem,supporting for the application such as Spark Streaming,DataFrame,SparkSQL,MLlib,GraphX,SparkR et al.PF_Ring is a new type of socket of the network that is developed by Luca Deri.PF_Ring uses to increase the efficiency of the kernel processing packets and balance the applied procedures.It can greatly improve the speed of the packet capture.PF Ring reduces the number of memory copy,and distributes the received data to different circular buffers.Using the above methods,we can also increase the capture performance of the packet through the cluster.The innovation point of the paper is that we proposed distributed computing framework Spark Streaming hasing a bottleneck in network performance when dealing with large amounts of small packets.Combined with the experimental data,the reason of low network utilization is determined by analyzing the underlying data module of Spark.For the reason of the problem from PF_Ring zero-copy technology,data and system command hierarchy,multiple kernel binding load balancing,and system unlocking operation as the main entry point.The optimization solution for custom user protocol system is proposed for the reason of the problem and achieved the desired optimization effect.In this paper,three steps are used to realize the optimization of Spark Streaming flow calculation framework,Basing on the relations between the pivotal parameters of Spark Streaming which related to the network performance,the monitoring data which can effectively reflect the status of network broadband will be extracted.And then by comparing and analyzing the monitoring data,we can determine that the reason of low utilization rate of network under special circumstances is the large expanse of kernel system which induced by the big data flow frame dealing with a large number of network IO interruptions in unit time.By optimizing the scheduling between system network card and CPU(CPU and thread),reducing switch and locality expanse and analyzing the monitoring data,it proves that the network optimization can improve the network performance of Spark Streaming,but this also exist limitations.We propose that the realization of protocol system can be stripped from the kernel of operating system by combining PF_Ring,this can both reduce the expanse of system management and avoid the additional data movement.The results demonstrate that comparing with the present open-source version of Spark-1.4.0,we can effectively reduce the time delay,decrease the overhead and improved the utilization rate of the Worker nodes by using the technique of PF_Ring zero-copy to optimize Spark Streaming.And this technique can also meet the real-time computation ' s need when receiving and treating a large number of TCP packets.
Keywords/Search Tags:zero copy, PF_Ring, Spark, Spark Streaming, overhead
PDF Full Text Request
Related items