Font Size: a A A

Distributed Streaming Connection Optimization Based On Spark Streaming

Posted on:2019-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:Z M ChenFull Text:PDF
GTID:2428330593450541Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Spark Streaming is a rising star in the stream processing system under the big data environment.It uses a directed acyclic graph to divide the operation execution order according to the dependency relationship between the parent-child data set in the current operation.However,instead of combining the basic information of each data stream with the connection relationship between each data streams to make targeted processing of the multi-connection operation,its single evaluation criteria only allow to simply prioritize the order and make it hard to find the order of connections with higher execution efficiency.At the same time,the continuous query operations for the windows connected by multiple data streams are executed in a repeated and independent manner.Each time,the results are recalculated according to all the information in the current window,so there is a large amount of redundant calculations between adjacent windows,and the entire query is executed less efficiently.To solve the above problems,this paper proposes a heuristic search-based multiple data stream connection strategy and a timestamp-based intermediate result caching strategy.The appropriate connection sequence will be solved based on the connection tree constructed according to the undirected weighted graph corresponding to the data stream set.Then the caching mechanism will be established based on the advantage that the nodes in the connection tree can facilitate the data storage and the intermediate results will be utilized repeatedly in adjacent windows in order to reduce the redundant calculations.The main contribution of this paper will be:1)Heuristic Search-Based Multiple Data Stream Connection Strategy: By analyzing the relevant concept and features of the existing connection technology and graph in the relational database system and stream processing system,the connection relationship between data streams will be transformed to undirected connected graph;each point in the graph is weighted according to the flow rate of the data stream,and each side is weighted according to the scale of the intermediate quantity between the relevant data streams.The heuristic function is constructed by analyzing the multi-stream connection cost,and a heuristic search-based multiple data streamconnection query optimization strategy is proposed to find the most appropriate connection sequence through the connection tree;aiming at the characteristic of the continuous arrival of data in stream processing,a connection tree weight standard is proposed to calculate the weight while constructing the tree;in accordance with the basic characteristics of the data stream in the new time period,the point edge weights are updated periodically,and the connection tree is reconstructed according to the evaluation function,and a better solution is selected comparing the weights of the old and new connection trees.The dynamic connection tree is implemented to ensure continuous and efficient connection operations.2)Timestamp-Based Intermediate Result Caching Strategy: On the basis of the above,based on the advantage that the parent nodes of the connection tree can store the calculation results,combined with the sliding window technology and the storage features of the Resilient Distributed Datasets of the Spark platform,a timestamp-based intermediate result caching strategy is designed,further reducing the amount of calculations during the execution of connection operations.According to the calculation rules of multiple data stream connection operations under this caching strategy,a time-stamped cache recovery mechanism is proposed to make the calculation process more accurate and efficient.。3)Experimental Analysis: Based on the Kafka message queue and Spark Streaming platform,multiple test data streams are generated through the Kafka producer API.The Spark Streaming platform receives and processes data as the consumer,performs data stream connection operations,and verifies the feasibility of the first two strategies.It can be seen from the experimental results that the proposed multiple data stream connection strategy and intermediate result caching strategy can effectively reduce the execution time of multiple data stream connection operations.
Keywords/Search Tags:heuristic search, data stream, undirected weighted graph, connection tree
PDF Full Text Request
Related items