Font Size: a A A

Research On Data Placement Strategy For Data-Sharing Scientific Cloud Workflows

Posted on:2017-05-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y WuFull Text:PDF
GTID:2308330485464001Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Scientific workflows are data-driven based data-intensive and compute-intensive workflows and can automate the processing of customers’requests. As they can scientifically manage, transfer, analyze, stimulate and visualize workflows’ processes, workflows can help researchers to conduct scientific researches. And there are numerous of successful scientific area applications based on workflow technologies, such as High-Energy Physics, Weather Diagnostics and Bioinformatics.With the rapid development of Information Technology, Cloud Computing has becoming a next generation information technology platform. Cloud Computing service providers can offer customers a massive inexpensive and dynamic computing storage and network resource in a pay-as-you-go manner via Internet. As a high performance, extensible and elastic computing mode, Cloud Computing platform can dramatically reduce system’s costs and resources when establishing research institutions, enterprises and governments’ workflow systems, and hence provide users an ideal environment for deploying data-intensive scientific workflows. As a matter of fact, the analysis and application of scientific cloud workflow management system has become a hot research topic in both academic and industry world.Although scientific cloud workflow management system has innate superiority, it still faces several new challenges in the Big Data era. Usually, scientific cloud workflow is cooperated by several geographically distributed research institutes, universities, colleges, enterprises and government departments, and there has data sharing and interaction phenomenon within individual workflows or among multiple workflows. It is of great importance to allocate datasets to proper datacenters when facing the characteristics of privacy, data sharing and staging, multi-user and multi-task amoung cloud datacenters, the dynamic and uncertainty of cloud services in the Big Data era. The academic world has a series of researches about how to optimize data placement among cloud datacenters. However, most data placement solutions they obtained are based on the dependency between datasets and tasks.These solutions do not have a comprehensive study about how datasets’attributes have impacts on data placement and do not give a further research on how the sharing datasets among workflows can affect datasets’placement neither.Based on existing data placement researches, this paper summarizes traditional workflow data placement methods and call them as task-level data placement methods; gives a further study about data-sharing among workflows by optimizing the model and framework of workflows and defining the types of datasets; adopts Particle Swarm Optimization (PSO) algorithm based two-phrase data placement method to optimize data placement solutions and reduce data transfer cost when executing workflow applications. The main work and innovations of this paper are summarized as follow:1. Focusing on datasets, tasks and storage resources among workflows, this paper analyses the dependency of datasets to datasets, datasets to tasks and tasks to tasks and datasets’ allocation among cloud datacenters. Scientific workflows are data-intensive applications and tasks’ execution needs massive volume of relative datasets, so there have many-to-many relationships between datasets and tasks. Besides, workflows’datasets are of different types, for example, initial datasets (or original datasets), generated datasets (or intermediate datasets), privacy datasets, shared datasets, fixed datasets and so on. Because of the flexibility during workflows’ execution process, the relationships between datasets and tasks are very complicated. And for safety, workflows’ datasets and tasks are usually stored to several different cloud datacenters. According to these situations, we give a complicated analyze and classification of data types and clarify the dependency between datasets and tasks to facilitate the integration and optimization of workflows’task-level data placement frameworks and models.2. According to existing data placement researches, we conclude traditional task-level data placement model and present its relevant definitions data transfer cost model. Some existing cloud workflow data placement models are esoteric to some degree, and these applications are used for studying data transfer times/volume/time during workflows’ execution. Although these benchmarks can indicate data placement methods’performance, what users’primary concern is about the cost when deploying workflow systems into cloud platform. In this paper, we systematically constructing cloud workflow model, and analyzing workflows’internal data sharing phenomenon based on workflow datasets’ different types. And from the view point of customers, a task-level data transfer cost model is constructed for obtaining cloud workflows’data transfer cost and monitoring traditional data placement methods’ performance.3. Regarding of the limitation of task-level data placement model, this paper proposes a workflow-level data placement framework and data transfer cost model to optimize data placement schemas and reduce data transfer cost. Scientific cloud workflows are multi-user collaborated processes. During the execution of workflows, workflows’ tasks need the collaboration among several geographically distributed research institutions, colleges, enterprises or even governments. Data-sharing not only exists in individual workflows but also occurs among multiple workflows. E.g. several datasets are used for different research areas and these areas’workflows probably have some interactions for producing, calculating and dividing some common shared datasets. Traditional task-level data placement model distributes workflows’ data in isolation, and sharing data among multiple workflows is neglected, and the flexibility of distributing data is limited. Hench, the total data transfer cost is high. Therefore, this paper study data-sharing workflows’ frameworks based on these shared datasets among workflows and assemble these interactive workflows into one workflow on workflow level. Then we design workflow-level data transfer cost model to optimize data placement schemas in order to reduce data transfer cost.4. Based on the life-cycle and the different datasets’ attributes of cloud workflows, we design a two-phrase workflow data placement method to optimize dataset-datacenter maps according to datasets’ different phrases and types by calling Discrete Particle Swarm Optimization (DPSO) algorithm. Cloud workflow data placement is a NP hard problem and there have two categories of data placement methods for placing datasets:Clustering Methods and Intelligence Methods. The main research points are analyzing data transfer times, data transfer time and data transfer cost during workflows’ execution. Clustering methods usually divide datasets into several sub-blocks based on load balancing among datacenters and dependency among datasets, and then distribute sub-blocks into proper datacenters. In real cloud environments, cloud datacenters offer nearly unlimited storage resources. And it is not necessary for developers and users to worry excessively about datacenters’overload. Differ from clustering methods, Intelligence data placement methods can offer specific solutions according to the specification of cloud workflows and users’ demands. As one of Intelligence methods, PSO algorithm which with few parameters, low computing cost, fast coverage speed and high efficiency, is widely used for solving function optimization, task scheduling and data placement problems. This workflow data placement method contains build-time stage and runtime stage which place initial datasets and generated datasets to proper datacenters respectively. Both stages call DPSO-DPA algorithm to get flexible datasets’ placement maps (initial flexible datasets and generated datasets respectively). Finally, we can get the whole data placement solutions by assembling these two stages’ data placement maps together.To summarize, this paper has concluded traditional task-level data placement methods, analyzed data-sharing phenomenon, proposed a Data-sharing cloud workflows framework and data placement model, designed a workflow-level data transfer model, applied a two-stages data placement approaches which call DPSO-DPA algorithm to obtain final data placement solutions. Experiment results showed that our data placement method has the strongest robustness, reaches the optimal performance and can get the most cost-effective data transfer cost data placement solutions when comparing with the existing representative data placement methods.
Keywords/Search Tags:Cloud Computing, Scientific Workflow, Data-Sharing, Data Placement, Particle Swarm Optimization
PDF Full Text Request
Related items