Study On Benchmarking Stream Computing Frameworks At Scale

Posted on:2016-02-18

Degree:Master

Type:Thesis

Country:China

Candidate:R R Lu

Full Text:PDF

GTID:2308330476453475

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

While big data is becoming ubiquitous, interest in handling data stream at scale is also gaining popularity, which leads to the sprout of many distributed stream computing systems. However, complexity of stream computing and diversity of workloads expose great challenges to benchmark these systems. Due to lack of standard criteria, evaluations and comparisons of these systems tend to be difficult.This paper takes an early step towards benchmarking modern distributed stream computing frameworks. After identifying the challenges and requirements in the field, we raise our benchmark definition StreamBench regarding the requirements. StreamBench proposes a message system functioning as a mediator between stream data generation and consumption. It also covers 7 benchmark programs that intend to address typical stream computing scenarios and core operations. Not only does it care about performance of systems under different data scales, but also takes fault tolerance ability and durability into account, which drives to incorporate four workload suites targeting at these various aspects of systems.Finally, we illustrate the feasibility of StreamBench by applying it to two popular frameworks, Apache Storm and Apache Spark Streaming. We draw comparisons from various perspectives between the two platforms with workload suites of StreamBench. With our experiment settings and configurations, throughput for Spark is about 5 times that of Storm while Storm’s throughput may catch up when average record size grows. Storm’s latency in most cases is far less than Spark’s whose latency is at the scale of second, but will exceed Spark when workload complexity and data scale grows. One node failure seems not to affect Spark obviously, but causes one third drop for Storm’s throughput and increases its latency four to five times. Both two frameworks have successfully gone through their two-day durability test. In addition, we also demonstrate performance improvement of Storm’s latest version with the benchmark. Our experiments shows Storm 0.9.3 has an average throughout growth of 26% and an average latency drop of 40%. Besides we evaluate the performance penalty of a new feature of Spark Streaming, reliable Kafka consumption. In our experiments, 40%–70% throughput drop is witnessed with the new feature.

Keywords/Search Tags:

stream computing, distributed computing, benchmark, big data

PDF Full Text Request

Related items

1	Research And Application Of Distributed Video Processing Platform Based On Stream Computing
2	Antnest: A Distributed Computing System Supporting Multiple Computational Modals
3	Design And Implementation Of Distributed Real-time Video Target Tracking System Based On Stream Computing
4	News Feed Stream System Based On Distributed Real Time Stream Computing
5	Design And Implementation Of Distributed Stream Computing System Based On Spark
6	Research&Development Of Distributed Stream Real-time Computing Framework
7	Research On Processing Methods Of Data Stream Based On Parallel Computing
8	Design And Implementation Of Computation Node In Distributed Stream Computing Platform
9	Design And Implementaion Of Distributed Stream Computing Node Management System
10	Research On Messaging Middle Used In Distributed Stream Computing System