Font Size: a A A

Benchmarking And Tuning Distributed Streaming Platforms

Posted on:2018-10-09Degree:MasterType:Thesis
Country:ChinaCandidate:S L QianFull Text:PDF
GTID:2428330596490042Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The big data technology is changing everyone's work and life style with great velocity and power.In this fast developing modern society,technology grows rapidly,information exchanges closely,resulting in the explosion of data.For example,web logs,trading orders,user data,news,social network content and so on.In order to process and analyze these large scales of data,big data technology emerged.Considering single node cannot handle such huge task,distributed processing comes to be the corner stone of big data technology.Stream computing is an important model.It has a wide range of usage scenarios which have been witnessed for a long time,including finance,network monitoring,sensors and log analysis.With the increasing amount of stream,coming like flood,streaming processing is also facing the challenge.The demand for processing large scale of stream keeps growing.However,for a specific streaming processing task,it is challenging to select proper distributed streaming platform.This is because the diversity of streaming computing platforms and the complexity of the configurations,as well as the lack of reference.In the meantime,how to select proper hardware resources for different platforms and applications is also unknown.This paper focuses on these problems.Considering the popularity and novelty of streaming platforms,we choose these three as our target: Apache Spark Streaming,Apache Storm and Apache Samza.In our benchmark work,we take Spark Streaming and Storm as our primary target,and tune some key parameters for them.As for benchmark tool,we modify and extend the benchmark tool: StreamBench,which consists of seven workloads,covering two types of input format: text and numeric data,containing computing with state or not.Targeting on these problems,benchmark workload suites are made,and related metrics are defined.In the process of benchmarking and tuning,the capability and fault-tolerance ability for these three platforms are evaluated,and a summary on some key knobs for performance tuning as well as on hardware utilization is displayed.After analyzing the results,we find that Spark Direct approach and Storm Trident can saturate the network resource and have larger throughput,especially for Spark Direct approach due to the well pipelined operations.Spark Receiver approach also has higher throughput than Storm.But Storm has much shorter latency.Spark is quite fault tolerant and stable with the increase of data scale and node failure.
Keywords/Search Tags:distributed streaming computing, benchmark, big data, Spark, Storm
PDF Full Text Request
Related items