Font Size: a A A

Research On Optimization Technology Of Data Parallel Processing Based On MapReduce

Posted on:2019-06-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z WangFull Text:PDF
GTID:1368330623953332Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of internet technology,cloud computing technology and smart devices technology,the amount of data is growing by 40% every year and the size would be 35ZB(Zettabyte)by 2020.That is a sign that we are in the "Big Data Era".In "Big Data Era",the computing models are changed from the traditional stand-alone models to parallel processing models.Parallel programming model is an important research topic in the parallel processing.And MapReduce is one of the most representative parallel computing paradigms renowned for its massive parallel processing power,straightforward programming model,dynamic loading balance mechanism,scalability,fault tolerance,ease of using and cost-effective services.MapReduce can meet the basic requirements of most data-intensive applications.However,the challenges are emerging in the face of the reality applications.There are some substantial optimization in the MapReduce,especially performance problem which is a major constraining factor in the development and application.Therefore,optimization technologies have been one of the factors prompting continued development for MapReduce.MapReduce still has some deficiencies,including lacking the adaptive data equalization ability and lacking the high-efficient processing ability for the computation-intensive applications.Hence,this thesis proposes a set of optimization strategies and technologies for MapReduce.The main contributions of this thesis can be summarized as the following four points:1)To optimatial the data skew on MapReduce,we propse a novel multiple-round approach to balance data load in the Reduce phase.Firstly,Mapper produces more fine-grained partitions than the number of Reducer and gathers the statistics on the sizes of fine-grained partitions.And then,JobTracker selects appropriate fine-grained partitions to be allocated to Reducers.We introduce a cost model and propose a heuristic assignment algorithm for this task.The allocated data may be skewed with the processing,so we applay the mulity allocation round to build the balance partitions.Each Reducer only need merger all its own partitions before executing reduce function.Finally,we experimentally compare our approach with Closer,which uses a segment partition method,on both Zipf and real datasets.Our experimental results show our method achieves more balanced data load.2)In the first innovation,the incremental allocation strategy can not automaticly select a proper number of partitions to allocate in each round.We propose a Markov Decision Process(MDP)model to optimize the problem of multiple-round micro-partition scheduling.The incremental partitioning approach in innovation 1 uses the principle of equalization while select the partition candidate,that is,the number of partitions allocated per round is equal.However,this method can only solve the centralized data distribution problem.Therefore,we improve an adaptive incremental partition optimization technique based on Markov decision process.Our method expresses the process of partition allocation as a Markov decision process.Fristly,we build a cost model to evaluate the cost of each partititons' allocation.And then,select the partitions which can achieve the maximum benefit collection assignment in each decision point.And then,after mulity decision round,all micro-partitions should be allocated on Reducers.Finally,we experimentally compare our approach with Closer,which uses a segment partition method,on both Zipf and real datasets.Our experimental results show our method achieves more higher performance.3)Under MapReduce,the parallel maximal clique enumeration algorithm based on multipath partition is inefficient.We proposed a new approach for maximal clique enumeration,which identifies dense subgraphs by binary graph partitioning.This method partitions a graph into only two parts by iteration.During the process of partition,it enumerats the maximal clique.This can realize the search efficiency and balance.We achieve this method in standalone and parallel environment.For the stand-alone algorithm,it fouces on the high efficiency.Our parallel algorithms are primarily implemented and evaluated on MapReduce,a popular shared-nothing parallel framework,but can easily generalize to other shared-nothing or sharedmemory parallel frameworks.Finally,compared with the traditional method by randomly generating data and real data,experiments show that parallel binary segmentation algorithm has higher efficiency.4)Base on the third innovation,the parallel maximal K-plex enumeration algorithm based on multi-path partition is inefficient on MapReduce.This is a new approach for maximal Kplex,which identifies dense subgraphs by binary graph partitioning.This method partitions a graph into only two parts by iteration.During the process of partition,it enumerats the maximal K-plex.This can realize the search efficiency and balance.We achieve this method in stand-alone and parallel environment.For the stand-alone algorithm,it fouces on the high efficiency.Our parallel algorithms are primarily implemented and evaluated on MapReduce,a popular shared-nothing parallel framework,but can easily generalize to other shared-nothing or shared-memory parallel frameworks.Finally,compared with the traditional method by randomly generating data and real data,experiments show that parallel binary segmentation algorithm has higher efficiency and less search space.
Keywords/Search Tags:Parallel optimization techniques, MapReduce Framework, data partition, Incremental allocation, Markov Decision Process, Maximal clique enumeration, Maximal k-plex enumeration, Parallel graph processing, MPI, Hadoop
PDF Full Text Request
Related items