Font Size: a A A

Research On Methods Of Performance Optimization And Energy Saving In Big Data Processing System

Posted on:2020-10-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:L ChenFull Text:PDF
GTID:1368330578457471Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,big data has influenced the lifestyles of people.Due to the restriction of traditional data processing systems,researchers proposed MapReduce,which is a parallel computing model,and developed some open source big data processing systems,such as Hadoop and Spark,to analysis and mine the potential value of big data.With the computing power of mobile terminals increasing rapidly,the data amount it process become very large.the mobile big data processing system represented by Mobile Edge Computing(MEC)has been widely used.When these new technologies bring convenience to big data processing,they also bring new challenges and problems.Based on the former research on big data processing,this dissertation focuses on the performance optimization and energy saving on big data processing systems and carries out the research in the following aspects:(1)The MapReduce load balancing strategy MRSIM(Mitigate Reducer Skew In MapReduce)for data skew is proposed,which solves the uneven load between the reducer nodes in the Hadoop cluster caused by data skew,resulting in lower performance of MapReduce jobs.In order to reduce the additional system overhead,the MRSIM policy embeds a load monitoring module on each DataNode.Combining the characteristics that Reducer nodes are with sufficient computing resources in shuffle stage,this model performs load statistics in the process of pulling the intermediate result by these Reducer nodes.In the process of balancing the load on the Reducer node,MRSIM fully considers the system overhead caused by data transmission.(2)The Bayes classifier-based MapReduce computational model optimization strategy BAPM(naive BAyes classifier based Partitioner for Mapreduce)is proposed.This strategy comprehensively considers data localization and data skew,and improves MapReduce job execution performance.BAPM can automatically determine the main influencing factors of job speed in data localization and data skew for different types of MapReduce jobs under different bandwidth conditions.BAPM includes two optimization algorithms LPS(data Locality Prior to data Skew)and SPL(data Skew Prior to data Locality),which consider data localization and data skew in the reverse order.Based on the amount of data generated by the map stage to produce intermediate results and the amount of input data in the map phase,BAPM classifies MapReduce jobs.BAPM uses the available bandwidth between the MapReduce job type and the Hadoop cluster DataNode as the category attribute,and uses the Bayesian classifier to automatically select between the LPS and SPL algorithms,thereby effectively improving the operational efficiency of the MapReduce job.(3)The optimization strategy PIY(Partitioner In Yarn)of MapReduce computing model is proposed.This strategy improves the execution performance of MapReduce jobs based on comprehensive consideration of data skew and data localization in heterogeneous Hadoop clusters.In the PIY strategy,firstly,the proposed parallel reservoir sampling algorithm can quickly and accurately obtain the distribution of MapReduce job input data.Secondly,based on the sub-collection summation problem,the proposed BASH algorithm considers the heterogeneous factors of the cluster,and distributes the skewed intermediate result data to each Reducer node according to the computing power of the node.Finally,for DataNodes that undertake both Map operations and Reduce operations,BASH reduces the amount of data transfer on these nodes during the shuffle phase,thereby increasing the degree of data localization for Reduce jobs.In order to solve the long tail phenomenon of data in PIY,an approximate relaxation subset and summation algorithm are proposed and integrated into the above BASH algorithm to form a new algorithm E-BASH(Extension of BASH).In addition,a method for determining the optimal sampling rate is proposed to ensure the sampling accuracy and sampling speed of the parallel reservoir sampling algorithm.(4)The mobile terminal energy optimization algorithm APO(Adaptive Partial migration algorithm)is proposed.The algorithm is for an application scenario in which a mobile terminal can communicate with aplenty of heterogeneous edge server clusters(ESCs)and select one of the servers for computing migration.Based on the dynamic voltage adjustment technology,the mobile terminal performs computation migration by automatically adjusting the CPU frequency,transmitting voltage,determining the appropriate partial offload ratio,and selecting an appropriate edge server,thereby achieving the goal of minimizing the energy consumption of the mobile terminal on the basis of ensuring the execution time constraint of the job.For this problem,we establish a mathematical model,and use the parameter subtraction method to prove that the problem is a convex optimization problem,and then obtain the global optimal solution through the APO algorithm.The experimental results show that the proposed methods can effectively improve the performance of big data processing system and reduce the energy consumption of equipment.
Keywords/Search Tags:Big data processing, MapReduce, Hadoop, Moving Computing, Data skew, Data locality, Cluster heterogeneity, Energy saving
PDF Full Text Request
Related items