Research On Key Technologies Of Performance Tuning Of Jobs In Distributed Data Processing System

Posted on:2017-08-23

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J Liu

Full Text:PDF

GTID:1318330503482836

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the increase of large scale data in various sectors, distributed data processing technology is widely used in various data analysis. The MapReduce has many outstanding merits such as easy to use, easy programming, fault-tolerant capabilities, and high cost-effective. Due to these outstanding merits, MapReduce has become a widely used model for distributed parallel processing in the large-scale data analysis. However, with the growing demand for data processing, some shortcomings of MapReduce are gradually revealed. The shortcomings include complex configuration parameters, imperfect task scheduler, complex data locality, and inefficient allocation of slots. These shortcomings lead to the low efficiency of Map Reduce jobs. Performance tuning is an important method of optimizing the performance of the MapReduce. Performance tuning can optimize the execution time in Map Reduce, and complete the jobs according to user requirements in the case of reasonable use of resources. Therefore, the study of performance tuning for MapReduce shows important scientific significance and application value.This thesis researchs serveral key technologies in the performance tuning of the MapReduce jobs. Based on the analysis of the summarization of related technologies for performance tuning of Map Reduce jobs, the I/O cost function is introduced in order to prove that configuration parameters are very important for job running time. In addition, a feature selection algorithm is proposed to select the important configuration parameters for MapReduce jobs. Moreover, the optimizing of data locality, replica placement strategy, and task scheduler are proposed to reduce the job running time. The main contributions of this thesis are listed as follows:(1) The objective functions of I/O read and write bytes and I/O request number are introduced in order to prove that there are some important configuration parameters which have great influence on job running time. In addition, it is verfied that the influence degree of each configuration parameter for job execution time is different.(2) This thesis proposes a feature selection algorithm based on kernel penalized clustering algorithm(IK-means) in order to solve the problem that the administrator is difficult to configure the many configuration parameters in MapReduce. In the algorithm, Gaussian kernel is replaced by anisotropic Gaussian in orde to accurately determain the infulence of the features. Anisotropic Gaussian decides the importance of each feature by the various parameters from the different directions(kernel width of different dimensions). In order to near to the original clustering effect, gradient descent algorithm is introduced to select a set of features that is the most close to the original kernel width. In order to address the problem that clustering algothim is sensitive to initial center, this thesis proposes a global awareness local density initial point selection algorithm. Theoretical and experimental results show that the proposed algotithm can achieve good results in the selection of configuration parameters in the MapReduce.(3) This thesis proposes an algorithm based on bipartite graph minimum weight matching to address the data locality problem of multiple tasks at the same time. In addition, a self-adaption replica placement algorithm based on hot blocks is also proposed. The problem how to determine the backup copy in dynamic replica placement is solved by the identification of hot data. Theoretical and experimental results show that the self-adaption replica placement algorithm can effectively support the bipartite graph minimum weight matching algorithm, which effectively improves the effectiveness of data locality of multiple tasks.(4) This thesis proposes an improved task scheduling algorithm, which not only can satisfy the time demand of users, but also can minimize the resource consumptions. In the algorithm, the job profile is introduced to calculate the execution time of the new jobs throught the time and resource consumption information of the completed jobs. This algorithm not only meets the time demand of users, but also solves the problem of excessive consumption of resources during the operation of jobs. The effectiveness of the algorithm is verified from the theoretical analysis of the execution process of jobs. At the same time, experimental results also verify the advantage of the proposed algorithm in job execution time and slot resource consumptions1.

Keywords/Search Tags:

MapReduce, Kernel Methods, clustering, Task scheduler, Replica placement strategy

PDF Full Text Request

Related items

1	Replica Placement Strategy Research In MapReduce Cluster
2	Research And Optimization Of Parallel Computing Framework Based On MapReduce
3	Research On Dynamic Management Of Data Replicas In Heterogeneous Hadoop Cluster
4	Cloud Replica Strategy Based On Multi-objective Optimization
5	Research On Data Placement Technology In Mapreduce-styled Data Processing Platform
6	Research On Strategy Of Data Replica Placement For Geo-distributed Cloud Storage Services
7	Research On Optimization Of Big Data Storage Replica Strategy In Cloud Environment
8	Research On Efficient Replica Management Strategy In Cloud Environment
9	Research On Energy-saving Replica Placement Strategy Of Storage-type Data Center Based On Thermal Aware
10	Research On Replica Management Strategy In Heterogeneous Cluster