Font Size: a A A

The Distinct Element Problem In Equi-join For Multiple Data Streams

Posted on:2021-04-20Degree:MasterType:Thesis
Country:ChinaCandidate:C XiongFull Text:PDF
GTID:2428330623465037Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,sensors and other technologies,the data is generated faster and faster.All kinds of valuable information are hidden in these generated data,which provides more and more conveniences for people's daily life by mining and utilizing these information.In many application scenarios,information is provided to users through data streams.Usually,the information is very time-sensitive and rarely stored in traditional databases,and should be processed "On the fly" when generated.In addition,due to different application scenarios,each data stream usually provides only part of the information.It is necessary to obtain complete information by combining data streams from various sources.In the processing of data streams,stream joins can integrate information among multiple data streams for complete information.Spark Streaming is a system platform for processing stream joins in big data environment.It determines the execution order of joins according to the dependency relationship between parent-child data sets of current operations.However,due to the single evaluation standard,the join operation of multiple data streams can only be divided into simple sequences,and the join of data streams cannot be optimized more reasonably according to the overall correlation among these data streams,resulting in low join execution efficiency.On the basis of in-depth research and the summary of related work,this thesis studies the distinct element counting problem of multiple data streams,and finally optimizes the equi-join of multiple data streams.The details are listed as follows:First of all,this paper studies the global correlation among multiple data streams,i.e.,the problem of distinct element counting of data streams.By a comparative study of several existing distinct element counting algorithms,an optimization scheme of multiple data stream equi-join based on the Hamming norm and a join tree.The distinct element counting method based on the Hamming norm can reflect the approximate correlation among the data streams in the current sliding window,and pre-processes the data for the equi-join optimization of the subsequent multiple data streams.This part is the basic step for the subsequent equi-join optimization.Secondly,according to the characteristics of equi-join of multiple data streams,the join relationship between each pair of data streams is transformed into an undirected graph model by analyzing the relevant characteristics between multiple data streams.Each edge in the graph is assigned a value according to the intersection of distinct elements between data streams,and then an appropriate join sequence is found according to the values of each edge in the graph.In order to cope with the continuous arrival of data in the data stream,the join tree is periodically updated by comparing the weights of the edges in the undirected graph,so as to find out a dynamic and efficient tree and adapt to fast and continuous data stream processing.Finally,this paper generates multiple test data sets through Kafka message queue,receive and process data sets on Spark platform,and perform multiple data stream join operation.It can be seen from the experimental results that the strategy based on distinct elements and join tree reduces the intermediate result scale of multiple data stream join by about 25%,and improves the join efficiency by about 16%.
Keywords/Search Tags:real-time data stream, multiple data stream join, distinct element count, join tree, undirected graph
PDF Full Text Request
Related items