Font Size: a A A

Research On Key Technologies Of Performance Tuning Of Jobs In Distributed Data Processing System

Posted on:2017-08-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:1318330503482836Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the increase of large scale data in various sectors, distributed data processing technology is widely used in various data analysis. The MapReduce has many outstanding merits such as easy to use, easy programming, fault-tolerant capabilities, and high cost-effective. Due to these outstanding merits, MapReduce has become a widely used model for distributed parallel processing in the large-scale data analysis. However, with the growing demand for data processing, some shortcomings of MapReduce are gradually revealed. The shortcomings include complex configuration parameters, imperfect task scheduler, complex data locality, and inefficient allocation of slots. These shortcomings lead to the low efficiency of Map Reduce jobs. Performance tuning is an important method of optimizing the performance of the MapReduce. Performance tuning can optimize the execution time in Map Reduce, and complete the jobs according to user requirements in the case of reasonable use of resources. Therefore, the study of performance tuning for MapReduce shows important scientific significance and application value.This thesis researchs serveral key technologies in the performance tuning of the MapReduce jobs. Based on the analysis of the summarization of related technologies for performance tuning of Map Reduce jobs, the I/O cost function is introduced in order to prove that configuration parameters are very important for job running time. In addition, a feature selection algorithm is proposed to select the important configuration parameters for MapReduce jobs. Moreover, the optimizing of data locality, replica placement strategy, and task scheduler are proposed to reduce the job running time. The main contributions of this thesis are listed as follows:(1) The objective functions of I/O read and write bytes and I/O request number are introduced in order to prove that there are some important configuration parameters which have great influence on job running time. In addition, it is verfied that the influence degree of each configuration parameter for job execution time is different.(2) This thesis proposes a feature selection algorithm based on kernel penalized clustering algorithm(IK-means) in order to solve the problem that the administrator is difficult to configure the many configuration parameters in MapReduce. In the algorithm, Gaussian kernel is replaced by anisotropic Gaussian in orde to accurately determain the infulence of the features. Anisotropic Gaussian decides the importance of each feature by the various parameters from the different directions(kernel width of different dimensions). In order to near to the original clustering effect, gradient descent algorithm is introduced to select a set of features that is the most close to the original kernel width. In order to address the problem that clustering algothim is sensitive to initial center, this thesis proposes a global awareness local density initial point selection algorithm. Theoretical and experimental results show that the proposed algotithm can achieve good results in the selection of configuration parameters in the MapReduce.(3) This thesis proposes an algorithm based on bipartite graph minimum weight matching to address the data locality problem of multiple tasks at the same time. In addition, a self-adaption replica placement algorithm based on hot blocks is also proposed. The problem how to determine the backup copy in dynamic replica placement is solved by the identification of hot data. Theoretical and experimental results show that the self-adaption replica placement algorithm can effectively support the bipartite graph minimum weight matching algorithm, which effectively improves the effectiveness of data locality of multiple tasks.(4) This thesis proposes an improved task scheduling algorithm, which not only can satisfy the time demand of users, but also can minimize the resource consumptions. In the algorithm, the job profile is introduced to calculate the execution time of the new jobs throught the time and resource consumption information of the completed jobs. This algorithm not only meets the time demand of users, but also solves the problem of excessive consumption of resources during the operation of jobs. The effectiveness of the algorithm is verified from the theoretical analysis of the execution process of jobs. At the same time, experimental results also verify the advantage of the proposed algorithm in job execution time and slot resource consumptions1.
Keywords/Search Tags:MapReduce, Kernel Methods, clustering, Task scheduler, Replica placement strategy
PDF Full Text Request
Related items