Font Size: a A A

Research On Data Partitioning And Placement Strategy Of Spark Streaming

Posted on:2018-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:C ChenFull Text:PDF
GTID:2348330563952446Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Spark Streaming,as the state of art in the field of big data batch streaming,is an extension of the Spark engine.Therefore,the stream computing can be considered as a series of short Map/Reduce-style batch processing jobs to obtain high data throughput and near real-time data processing efficiency.Data partitioning and data placement are two essential stages of Spark Streaming.The data partitioning divides streaming data into blocks of data according to the time sequence;and the data placement selects the computing node to place the data block.Currently,data partitioning and data placement apply static data partitioning strategy and random data placement strategy,respectively.However,due to being not adapt to the characteristics of load fluctuation and the static data partitioning strategy contribute much to the underutilization of computing capacity.on the other hand,the random strategy can be deal with the condition that tasks of a spark job distributed among computing node unevenly and lead to the inefficient data parallel processing.To solve the above problems,this paper proposes a dynamic data partitioning strategy and dynamic weighted data placement strategy for Spark Streaming platform.The data partitioning strategy uses the approximate one-dimensional search method to dynamically find the optimal block interval under the fluctuation of the data stream load to ensure the best data batch processing performance.On the basis of actual computing power of the nodes which can dynamically allocate the weight of the data for the nodes,the dynamic weighted data placement strategy can achieve the best match of data distribution and computing power obtained by the jobs in the nodes.The main contributions of this paper are summarized as follows:1)A dynamic data partitioning strategy for Spark Streaming called DDPS.The dynamic data partitioning model is constructed by the approximate one-dimensional search method.The model is used to analyze the processing of the batch data before and after the block interval changes,and the block interval is corrected by the feedback adjustment method until it converges to the optimized block interval,achieving the best execution performance of batch stream processing.2)A dynamic weighted data placement strategy for Spark Streaming called DWDPS.According to the historical task execution information,the node is used as the granularity to construct the evaluation model of the node computing ability,and the node data placement weight is set according to the relative computing ability of the node.Finally,the target node of the data placement is selected according to the data placement weight and the number of the placed data blocks of the node,so as to ensure the good match of computing ability between data size and the high stream data processing efficiency.3)Prototype system implement and performance evaluation and analysis.The Spark Streaming based on open source can achieve the dynamic data partitioning strategy and weighted data placement strategy,and it can evaluate the performance by using streaming load.The results display that the average response time in dynamic data partitioning strategy is reduced by 11%,compared to the static data partitioning strategy,and the average processing time of batch data in dynamic weighted data placement strategy is reduced by 20% compared with the random data placement strategy.
Keywords/Search Tags:batched stream processing, Spark Streaming, data partitioning strategy, data placement strategy
PDF Full Text Request
Related items