Automatic Hive Parameter Optimization Based On Workflow Similarity

Posted on:2019-06-22

Degree:Master

Type:Thesis

Country:China

Candidate:Y Liu

Full Text:PDF

GTID:2428330566996851

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In the era of big data,the analysis of massive data and the extraction of valuable information from it have attracted more and more attention of enterprises.With the booming trend of mass data analysis technology,the Map Reduce model has been proposed and widely used.It provides scientific researchers and data analysts with an effective framework for analyzing big data.For users who are not good at programming,using Hive,a data warehouse that performs Map Reduce tasks based on SQL statements,can further reduce the difficulty of using Map Reduce.In order to achieve better execution efficiency,it is important to set the running parameters properly when performing Map Reduce tasks.Hive also has many adjustable parameters which have great influence on performance.People always need to set parameters empirically,which adds extra work to data analysts.If we can make this process transparent to users and let users focus more on the task itself,it will significantly increase the productivity.An approach based on workflow similarity to automatically set optimal Hive task parameters is proposed in this paper,which can solve performance problems caused by empirically setting parameters and improve the efficiency of the cluster.Firstly,a task is abstracted and translated into a multi-branch tree composed of many basic operations based on the execution plan of the Hive in the first part.Then in the second part,the edit distance is used to calculate the structural similarity between the tasks.And the metadata related to the operation is used to calculate the similarity of the data size.The tasks are clustered based on the similarity of tasks,and a regression model for parameters and execution time is constructed for each cluster.Finally,in the third part,search in a limited parameter space to find the parameter combination that minimizes the running time.Task clustering based on similarity can make the optimization methods more practical,and accelerate the accumulation of tasks,which makes it faster to reach the poin to build models and search for optimized parameters.Through experiments,the similarity measure method successfully quantifies the degree of similarity between different tasks,and builds the best regression model through comparison testing and cross validation.And a global optimization algorithm was used to find the optimal parameter settings for a certain type of task.A performance improvement between 5% and 15% was finally achieved by using this parameter setting.

Keywords/Search Tags:

Hive, Parameter Optimization, Workflow, Task Similarity

PDF Full Text Request

Related items

1	Method And Implementation For Hive-Based Offline Data Processing
2	Detecting Duplicate Workflow Tasks And Noise Logs To Support Process Modeling
3	The Research And Practice Of Performance Optimization Based On Hive
4	Research On Workflow Engine Supporting Dynamic Task Assignment
5	Research On Particle Swarm Optimization Based Task Scheduling For Cloud Workflow System
6	Research On Workflow Enginebased On Dynamic Demand
7	Workflow System, Task Scheduling Strategy
8	Task Scheduling And Virtual Machine Integration Of Data Intensive Batch Processing Workflow
9	Research On Hive Query Optimization Base On Parquet Format
10	Multi-Objective Optimization For Workflow Task Scheduling In Moblie Cloud Computing