Benchmarking And Tuning Distributed Streaming Platforms

Posted on:2018-10-09

Degree:Master

Type:Thesis

Country:China

Candidate:S L Qian

Full Text:PDF

GTID:2428330596490042

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The big data technology is changing everyone's work and life style with great velocity and power.In this fast developing modern society,technology grows rapidly,information exchanges closely,resulting in the explosion of data.For example,web logs,trading orders,user data,news,social network content and so on.In order to process and analyze these large scales of data,big data technology emerged.Considering single node cannot handle such huge task,distributed processing comes to be the corner stone of big data technology.Stream computing is an important model.It has a wide range of usage scenarios which have been witnessed for a long time,including finance,network monitoring,sensors and log analysis.With the increasing amount of stream,coming like flood,streaming processing is also facing the challenge.The demand for processing large scale of stream keeps growing.However,for a specific streaming processing task,it is challenging to select proper distributed streaming platform.This is because the diversity of streaming computing platforms and the complexity of the configurations,as well as the lack of reference.In the meantime,how to select proper hardware resources for different platforms and applications is also unknown.This paper focuses on these problems.Considering the popularity and novelty of streaming platforms,we choose these three as our target: Apache Spark Streaming,Apache Storm and Apache Samza.In our benchmark work,we take Spark Streaming and Storm as our primary target,and tune some key parameters for them.As for benchmark tool,we modify and extend the benchmark tool: StreamBench,which consists of seven workloads,covering two types of input format: text and numeric data,containing computing with state or not.Targeting on these problems,benchmark workload suites are made,and related metrics are defined.In the process of benchmarking and tuning,the capability and fault-tolerance ability for these three platforms are evaluated,and a summary on some key knobs for performance tuning as well as on hardware utilization is displayed.After analyzing the results,we find that Spark Direct approach and Storm Trident can saturate the network resource and have larger throughput,especially for Spark Direct approach due to the well pipelined operations.Spark Receiver approach also has higher throughput than Storm.But Storm has much shorter latency.Spark is quite fault tolerant and stable with the increase of data scale and node failure.

Keywords/Search Tags:

distributed streaming computing, benchmark, big data, Spark, Storm

PDF Full Text Request

Related items

1	Study Of Distributed Brain Storm Optimization Algorithm And Its Application
2	Designand Implementation Of Data Stream Clustering Algorithm StreamCKS Based On Spark Streaming
3	Research On The Performance Modeling Of Spark Streaming
4	Research And Implementation Of Test Data Processing System Based On Spark Streaming
5	A System For Distributed MD Data Analysis Based On Spark
6	The Research And Implementation Of Multiple Sensor Data Fusion Technology
7	Design And Implementation Of Spark Platform For Big Data Streaming Computing Based On Kubernetes
8	Design And Implementation Of A Distributed And Real Time Video Stream Data Processing Platform Based On Spark
9	Research On Task Scheduling Based On Resource Aware In Storm Environment
10	Research On Elasticity Resource Management Strategy For Streaming Computing