Font Size: a A A

Research And Implementation Of Scientific Big Data Application Execution Optimization Mechanism In Multiple Data Center Environments

Posted on:2019-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:J ChenFull Text:PDF
GTID:2428330596460919Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rise of emerging technologies such as cloud computing and big data,large-scale scientific experiments in data-centric scientific fields,such as high-energy physics,astrophysics,bioinformatics,have become increasingly large-scale,and the amount of scientific data generated and accumulated has grown significantly,the analysis of scientific data has become more complex and in-depth.It is typical scientific big data applications.Scientific big data applications can often be modeled as scientific workflows.The scale of computing and storage resources required for their execution is large.In order to achieve efficient execution of scientific workflows in multiple data center environments,multiple scientific research institutions are required to aggregate their respective data center resources to support the execution of massive data storage and large-scale scientific workflow task scheduling.However,due to the relatively limited network bandwidth resources among data centers,a large number of data transmissions across data centers during the execution of distributed scientific workflows can easily become a performance bottleneck.As an important factor affecting the data transmission across data centers,the rational distribution of data and the efficient scheduling of scientific workflow tasks can effectively reduce the amount of data transmission between data centers and is the key to improving the efficiency of scientific workflow execution.The current research on data layout and workflow scheduling does not fully consider the execution characteristics for scientific big data applications,such as linked data access,fixed initial input data,and massive intermediate data storage,so it cannot achieve reasonable data layout and workflow task scheduling optimization.In order to optimize the execution of scientific big data applications in a multi-data center environment,this master's thesis is studied from the following three aspects:First,conduct research on the placement of massive scientific data.The execution of scientific workflows at each data center requires a large amount of initial data as input,so that relevant initial data is frequently requested by each data center for access.In order to reduce the cost of access to massive initial data,this paper considers the locality of data placement,introduces data access model feature and data center storage constraints to accurately model data placement problems as an integer programming problem.Besides,we also design and implement an efficient Lagrangian-relaxation-based heuristics algorithm to solve this data placement problem.Then,we study the task scheduling of scientific workflow.In order to reduce data communication across data centers during workflow execution,based on the rational data placement,this paper comprehensively considers the following workflow execution characteristics,the complex dependencies of scientific workflows,initial input data placement,intermediate data placement,and datacenter calculations,storage limitations,etc,and reasonably model the workflow task scheduling model,propose a novel heuristics based on a multilevel coarsening and uncoarsening graph partition framework,and adopt a specialized hybrid genetic algorithm-based approach to efficiently solve the problem.Finally,implement and deploy a scientific big data application workflow management system.In order to further verify the effectiveness of the optimization strategy proposed in this paper,we implement scientific data layout and scientific workflow task scheduling methods based on the existing workflow management system.At the same time,it was deployed on the data center environment such as cloud computing center of Southeast University and Shuguang Computing Center to verify the effectiveness of the research results.This paper conducts an in-depth study on the optimization mechanism of scientific big data applications under multi-datacenter environment,and proposes access pattern-aware data placement methods and data placement-sensitive task scheduling methods.Through a large number of simulation experiments and real multi-datacenter environment experiments,the optimization strategy proposed in this paper can effectively reduce the data transmission across the datacenters during the execution of scientific workflows and optimize the execution of scientific big data applications.
Keywords/Search Tags:Scientific Big Data Applications, Multiple Data Centers, Data Placement, Workflow Scheduling, Workflow Management System
PDF Full Text Request
Related items