Font Size: a A A

Data Placement Strategy Research For Scientific Workflow In Hybrid Cloud Computing

Posted on:2015-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:F MaFull Text:PDF
GTID:2268330428464091Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With development and popularization of the Internet and data storage technology, data-intensive application has been widely used in weather forecasting, astrophysics, bioinformatics and other scientific computing disciplines. Data-intensive application processes enormous data which usually up to TB or PB level. And there is a certain association relationship between these application data. By using workflow technology allows a variety of tasks that has complex computing features to execute automatically. Data-intensive application using workflow technology is called scientific workflow. Cloud computing as one of typical distributed network computing technology can provide storing and computing resources by using relatively cheap hardware and software facilities to scientific workflow application, and also provide a new low-cost deployment and implementation program. The scientific workflow application deployed on cloud computing environment can save a lot of implementation costs, and also provide these researchers around the world with resource sharing and cooperative platform through the Internet.While cloud computing system dynamically provides high-performance computing resources and massive storage space to scientific workflow application’s execution, also brings very big impact and challenges to users’ privacy protection and information assets safety. Gartner surveyed cloud computing in2012, and the results showed that more than70%of the surveyed enterprises recently did not consider using cloud computing. Their CTOs thought the main reason was data security and privacy protection problems in cloud computing. In addition, the data has weight. When users’ data exist in the remote server provided by cloud computing servicer, the data will become extremely heavy and difficult to migration. Data migration cost much more than data storage.In order to meet the needs of enterprises’ security and migration, cloud computing evolved into public cloud computing, private cloud computing and hybrid cloud computing. Public cloud computing has better expansibility and flexibility, and suitable for deploying open applications. Private cloud computing is more safe and easy to control, and suitable for deployment of the critical and sensitive data. Hybrid cloud computing is a new cloud computing architecture that is a mixture of public cloud computing and private cloud computing. Hybrid cloud computing is also a new cloud computing model with expansibility and security. According to application requirements and cost constraints, enterprises may flexibly select from public cloud computing and private cloud computing, and build a computing and resource center with high availability、dynamic extensibility and high security, which forms hybrid cloud computing model.Public cloud computing is a kind of computing model which provides IT related resources in the form of services to users by pay-per-use. The execution of scientific workflow application deployed in a hybrid cloud environment is a cooperation of public cloud computing and private cloud computing. Data movement across data centers is difficult to avoid during the execution that brings two problems:(1) Data movement across data centers may incur large transmission time;(2) Data movement across data centers may also incur much transmission costs. This paper proposes two different data placement strategies for these two problems.Aiming at transmission time problem, traditional data placement methods used load balancing model to divide data dependency matrix, and place data sets. However they do not take into account transmission time cost caused by balancing load. We put forward a new kind of classification model based on the degree of data dependency damage, and with this model we propose a new kind of time-effective data placement method that consists of two algorithms:static placement algorithm in initial stage and dynamic placement algorithm in running stage. Experiments show that the method can effectively reduce data transmission time across data centers about scientific workflow execution.Aiming at transmission costs problem, traditional approaches reduce costs only for individual workflows. However workflow system always consists of multiple workflows, and workflows usually share some datasets with each other. Traditional approaches which only considers individual workflow is not necessarily for workflow system with multiple workflows. In this paper, by facilitation of Particle Swarm Optimization (PSO), we build a novel data transmission costs model for developing a new multi-datacenter cost-effective data placement strategy.By simulating hybrid cloud computing model, this paper builds a virtual hybrid cloud data center environment. Then we run two data placement strategy proposed in this paper with other similar data placement strategies and evaluate the performance by comparing data transmission time and data transmission costs. The experiment results show that data transmission time and costs across data centers which uses our data placement strategies is significantly reduced compared to other similar data placement strategies. Our data placement strategies have great significance not only for scientific workflow optimization study deployed on hybrid cloud computing, but also for other data intensive applications in hybrid cloud computing. The outcome of this paper can reduce operation cost for cloud service provider, and provide cheap, safe and efficient computing and storage services for enterprises.
Keywords/Search Tags:Scientific Workflow, Hybrid Cloud Computing, Data Placement, DataTransmission time, Data Transmission Costs, Clustering Algorithm, Particle SwarmOptimization Algorithm
PDF Full Text Request
Related items