Font Size: a A A

Distributed Stream Join System Load Balance Strategy Studies

Posted on:2020-06-18Degree:MasterType:Thesis
Country:ChinaCandidate:S J ZhouFull Text:PDF
GTID:2428330590458359Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In the big data era,many applications are required to perform quick and accurate join operations on large-scale real-time data streams,such as stock trading and online advertisement analysis.It is not easy for traditional join system to meet the demand of those applications due to rapid,endless data stream to be processed.In order to satisfy the need of those real-time applications,stream join system has been proposed.To achieve high throughput and low latency,distributed stream join systems explore efficient stream partitioning strategies to execute the complex stream join procedure in parallel.Existing systems mainly deploy two kinds of partitioning strategies,i.e.,random partitioning and hash partitioning.Random partitioning strategy partitions one data stream uniformly while broadcasting all the tuples of the other data stream.This simple strategy may incur lots of unnecessary computations for low-selectivity stream join.Hash partitioning strategy maps all the tuples of the two data streams according to their attributes for joining.However,hash partitioning strategy suffers from a serious load imbalance problem caused by the skew distribution of the attributes,which is common in real-world data.The skewed load may seriously affect the system performance.The tuples which lead to the heavy load skewness were explored.To find out these tuples,an efficient key selection algorithm,GreedyFit has been proposed.Based on GreedyFit,a lightweight tuple migration strategy to solve the load imbalance problem in real-time has been designed.A new distributed stream join system,FastJoin,is to run the real-time applications which executing join operation.Experimental results using real-world data show that FastJoin can significantly improve the system performance in terms of throughput and latency compared to the state-of-the-art stream join systems.
Keywords/Search Tags:distributed stream join system, data skew, dynamic load balancing, load migration
PDF Full Text Request
Related items