Font Size: a A A

Intermediate Data Partitioning Strategy And Transmission Optimization For MapReduce

Posted on:2020-10-24Degree:MasterType:Thesis
Country:ChinaCandidate:J B JiangFull Text:PDF
GTID:2428330599476476Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,the data has shown a trend of explosive growth.The era of big data has come.Big data contains great value and is the "diamond mine" of the 21 st century.MapReduce parallel computing framework is the mainstream big data processing technology.However,intermediate data partitioning and transmission in MapReduce are the main bottlenecks affecting overall performance.Since it is difficult to obtain the distribution law of intermediate data in advance,the default partitioning strategy tends to cause unbalanced distribution of data on the Reducer phase,which leads to an unbalanced load on the Reducer computing task.In addition,due to the Reducer needs to wait for the Mapper to finish before it can get the intermediate data,which leads to data transmission delay.This dissertation studies the method of intermediate data partitioning and transmission in MapReduce parallel computing framework in order to achieve balanced partitioning of data and reduce transmission delay.The main research contents include:(1)Aiming at the data skew of intermediate data,an iterative data balanced partitioning strategy for MapReduce is proposed.The data blocks to be processed at the Mapper phase are subdivided and processed in an iterative manner.According to the results of micro-partition allocation of iterated rounds,the current micro-partition allocation scheme of iterated rounds is determined,so as to adjust the data skew generated by the previous iterations continuously,and gradually realize data balanced partitioning.In addition,an iterative data balancing partitioning mechanism is given,including partitioning timing,partitioning criteria,partitioning evaluation and partitioning algorithm based on greedy strategy.(2)Aiming at the transmission delay of intermediate data,a pipeline data transmission optimization method for MapReduce is proposed.By dividing the effective computation of Mapper phase,intermediate data transmission and effective computation of Reducer phase into several stages,the execution is performed in a pipeline manner,hiding the delay overhead caused by the data transmission,and improving the data processing performance of the MapReduce framework.In addition,the data transmission pipeline optimization mechanism is given,including transmission timing,merging mode and transmission criteria.(3)With the public datasets,the data balanced partitioning and transmission optimization method are evaluated on Spark cluster and Actor model respectively.With the three sets of datasets and three big data algorithms,the effects of parameters in data partitioning strategy and transmission optimization method on the overall performance of Mapreduce is evaluated respectively,and then compared with other methods.When the dataset itself has a high degree of skew,such as the BST dataset,compared to the default partitioning strategy,the overall performance of the data balanced partitioning strategy is increased by 19.7% averagely with running the PageRank algorithm.When the amount of Shuffle is large,such as Inverted Index algorithm,compared to the default transmission framework,the overall performance of data transmission pipeline optimization method is increased by 45.9% averagely with calculating Konect dataset.
Keywords/Search Tags:big data, MapReduce, data partitioning, transmission optimization
PDF Full Text Request
Related items