Font Size: a A A

Research On The Filtering Problem Of The ? Join Between Multi-way Data Streams

Posted on:2021-03-29Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y HuFull Text:PDF
GTID:2428330611457231Subject:Computer technology
Abstract/Summary:PDF Full Text Request
A data stream is a set of data collections which is large,fast,sequential,and continuous.In recent years,applications for data stream processing such as e-commerce,network monitoring,and advertising systems have attracted increasing attention.As one of the basic operations,the connection plays a very important role in the processing of the stream. is the connection condition,including<,?,?,>.If is equal to ”=”,it becomes an equivalent connection.The purpose of join is to find specific objects that meet the connection conditions in different data sets.Aiming at the multi-stream data connection processing in the analysis and processing of massive streaming data,this thesis has studied separately from the two modules of equal connection and nonequivalent connection,and proposed efficient solutions respectively.In terms of non-equivalent connections,we propose FastThetaJoin which is an optimization technique for theta-join operation on multi-way data streams.As an essential query often used in many data analytical tasks,it is difficult to implement thetajoin operation on multi-way data streams in practical application.It always involves tremendous communication and computing overhead due to the data movements between multiple operation components.Therefore it is tricky to implement theta-join in a distributed environment.As with previous methods,FastThetaJion also tries to minimize the number of theta-joins,but it is distinct from others when make partition strategy,delete unnecessary data and do Cartesian product.As such,FastThetaJoin can not only effectively reduce the number of theta-joins but also improve the efficiency of its operations in a distributed environment.We implemented FastThetaJoin in the Spark Streaming framework,characterized by its efficient bucket implementation of the parameterzied windows.The experimental results show that compared with the existing solutions,our proposed method can not only reduce the theta-join overhead and speed up the theta-join processing improve the performance of theta-join.Compared with existing algorithms,FastThetaJoin can improve the speed by more than 30%.In addition,the specific effect of optimization is related to the nature of data streams.The greater the data difference is,the more obvious the optimization effect is.In terms of equivalent connections,we present Multiple Cuckoo Filter(MCF),a new stream-oriented filter.MCF is based on the classic cuckoo filter,which can be used to determine whether there is a certain element in all data streams in a specific time period.This method decomposes the membership query of multiple data sets into single operations.It works in data streams and each data stream has an independent filter.Experimental results show that as the number of data streams increases,the time for insert and query operations also increases.The query time of MCF also increases gradually as the sliding window decreases and the number of windows increases.
Keywords/Search Tags:Theta-Join, Multi-Way data streams, Data streams, Cuckoo filter, Filter
PDF Full Text Request
Related items