| Scientists can use the scientific workflow management system deployed in the cloud environment to carry out scientific research activities such as cross-regional cooperation and complex simulation experiments.The scientific workflow management system will generate a large amount of intermediate data in the process of performing scientific workflow tasks,these intermediate data sets with complex dependencies.Therefore,how to manage intermediate data affects the performance of scientific workflow management system and the efficiency of scientific research.Due to the pay-as-you-go manner of the cloud,in order to realize the conversion between storage and computing of intermediate data sets.It is necessary to use the dependencies between these intermediate data for find the optimal storage strategy.Then compute the total cost of a scientific workflow management system.Thence how to effectively manage the intermediate data generated by the execution of scientific workflows deployed in the cloud to reduce the total cost of scientific workflow systems is the main research problem of this paper.This paper studies the problem of intermediate data management based on data provenance graph.Firstly,a data regeneration strategy is proposed for the problem of how to regenerate the intermediate data of the scientific workflow based on the data provenance graph.Then the data regeneration algorithm is verified.The results verify the effectiveness of the data regeneration algorithm.Secondly,an intermediate data storage optimization model based on data provenance graph is established.Then a genetic algorithm to find the optimal storage strategy is proposed.The algorithm is experimentally verified by data provenance graphs of different complexity,and the results show that the algorithm can find the optimal storage strategy.Finally,the intermediate data cost calculation method is improved to reduce the repeated calculation of data,and experimental verification is carried out on the data provenance graphs of different complexities.The experimental results show that the improved data cost calculation method can effectively reduce the total execution cost of the scientific workflow system.Then this paper studies the problem of intermediate data management based on data flow graph.Firstly,a data regeneration strategy based on data flow graph is proposed for the regeneration of intermediate data of scientific workflow represented by data flow graph.The data regeneration algorithm is experimentally verified by data flow graphs of different complexity,and the results prove the effectiveness of the data regeneration strategy.Secondly,an optimization model of intermediate data storage based on data flow graph is established and the model is verified.By counting the performance of different data flow graphs under various evaluation methods,the results show that the model and algorithm proposed in this paper can effectively find to the optimal storage strategy. |