Font Size: a A A

Automatic Hive Parameter Optimization Based On Workflow Similarity

Posted on:2019-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2428330566996851Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the era of big data,the analysis of massive data and the extraction of valuable information from it have attracted more and more attention of enterprises.With the booming trend of mass data analysis technology,the Map Reduce model has been proposed and widely used.It provides scientific researchers and data analysts with an effective framework for analyzing big data.For users who are not good at programming,using Hive,a data warehouse that performs Map Reduce tasks based on SQL statements,can further reduce the difficulty of using Map Reduce.In order to achieve better execution efficiency,it is important to set the running parameters properly when performing Map Reduce tasks.Hive also has many adjustable parameters which have great influence on performance.People always need to set parameters empirically,which adds extra work to data analysts.If we can make this process transparent to users and let users focus more on the task itself,it will significantly increase the productivity.An approach based on workflow similarity to automatically set optimal Hive task parameters is proposed in this paper,which can solve performance problems caused by empirically setting parameters and improve the efficiency of the cluster.Firstly,a task is abstracted and translated into a multi-branch tree composed of many basic operations based on the execution plan of the Hive in the first part.Then in the second part,the edit distance is used to calculate the structural similarity between the tasks.And the metadata related to the operation is used to calculate the similarity of the data size.The tasks are clustered based on the similarity of tasks,and a regression model for parameters and execution time is constructed for each cluster.Finally,in the third part,search in a limited parameter space to find the parameter combination that minimizes the running time.Task clustering based on similarity can make the optimization methods more practical,and accelerate the accumulation of tasks,which makes it faster to reach the poin to build models and search for optimized parameters.Through experiments,the similarity measure method successfully quantifies the degree of similarity between different tasks,and builds the best regression model through comparison testing and cross validation.And a global optimization algorithm was used to find the optimal parameter settings for a certain type of task.A performance improvement between 5% and 15% was finally achieved by using this parameter setting.
Keywords/Search Tags:Hive, Parameter Optimization, Workflow, Task Similarity
PDF Full Text Request
Related items