Font Size: a A A

Research On Execution Optimization For Data Intensive Scientific Workflow In Multiple Data Centers Environment

Posted on:2016-07-02Degree:MasterType:Thesis
Country:ChinaCandidate:M J WangFull Text:PDF
GTID:2308330503978053Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the advent of big-data era, a growing number of scientific experiments confront with difficulty to handle large amounts of data. Their data processing consists of multiple steps and dependencies within them. A data intensive scientific workflow is used to model such process. Data intensive scientific workflows is typically with very large scale data processing that needs many resources and the data accessed by the workflow maybe distributed over different research groups’data centers, therefore data intensive scientific workflow executed in multiple data centers is increasingly common. For example, AMS data processing and analysis procedure is a typical data intensive scientific workflow application. In order to cope with its massive data processing challenges, AMS data need to be distributed to multiple data centers which located in different area in the world. However, using this model in execution of data intensive scientific workflows has caused many problems. On one hand, data intensive scientific workflow processes large amounts of data and these data is distributed across multiple data centers; on the other hand, tasks in workflow maybe use data from different data center as input, thus network transmission is necessary to obtain these data. Due to bandwidth between data centers is limited, massive data movement across data centers is becoming key factor that influences the execution efficiency of data intensive scientific workflow.Efficient data management and task scheduling algorithm is the key to achieve execution optimization for data intensive scientific workflow in multiple data centers environment. For the problem of massive initial data transmission, existing work is based on data correlation, without considering data with large size but week correlation that cannot effectively reduce initial data transmission. For the problem of mediate data transmission, studies use simple task replication or multiple data copies may lead to low efficiency. However much work in literature neglect the data size and deep task replication when workflow executed, therefore, based on deep consideration of the characteristics of executing data intensive scientific workflow, new efficient scheduling algorithms and strategies are urgent to be proposed. In this paper, research works are carried out from the following four aspects. Firstly, for initial data placement, we cluster the initial data with the consideration of both data correlation and data size to reduce initial data transfer. Secondly, for intermediate data transmission, we proposed multiple level task replication strategy to acquire data locality and reduce intermediate data movement. Thirdly, for data sets must be required through data movement, data pre-staging strategy is proposed in this paper. Finally, MDC-SWMS, a multiple data centers workflow scheduling system is designed, implemented and deployed in Southeast University real cloud data center with combination of theoretical results proposed in this paper. Based on data from AMS experiment, tests of all the system functional modules are carried to validate the effectiveness of the scheduling system and the efficiency of the theoretical result.The research of execution optimization for data intensive scientific workflow in multiple data centers environment is explored deeply in this dissertation. Results from corresponding simulation and experiments in real cloud data center show that the strategies, scheduling algorithms proposed in this paper can effectively reduce data transmission between data center and improve the scientific workflow execution efficiency.
Keywords/Search Tags:Cloud Computing, Multiple Data Centers, Data-intensive Scientific Workflow, Multiple Tasks Replication, Data Pre-staging
PDF Full Text Request
Related items