Research On Workflow Model For Multi-domain Scientific Data Management And Its Provenance Mechanism

Posted on:2020-08-29

Degree:Master

Type:Thesis

Country:China

Candidate:Q Sun

Full Text:PDF

GTID:2428330599964890

Subject:Computer application technology

Abstract/Summary:

The development of capabilities of collecting and processing data in various scientific fields has led to the continuous excavation of the value of scientific data.In order to manage and utilize the increasing scientific big data better,researching and designing a good management ecology or method to improve the level of scientific big data management and analysis has become a research hotspot in various scientific fields.At present,many organizations continue to develop their own field-oriented scientific data management systems.Scientific workflow has also become a mainstream tool used by scientists to build and execute scientific experiments.To explore the nature of scientific data and the source of experimental results,scientists have also studied many provenance methods to validate,replicate,and reproduce scientific experiments.However,due to the heterogeneous multi-source characteristics of scientific data,scientists often need to acquire data in scientific data management systems and a wide variety of databases and scientific devices in various fields,and need to put a lot of effort into optimizing the scientific experiments built.And when developers design management systems,they usually need to consider compatibility and coupling between modules,which requires a deep understanding of various scientific fields.What's more,the coarse-grained provenance approach to scientific workflows means the loss of internal details of workflow steps,leading to the inference of incomplete or even incorrect data and call relationships,and the problem of dependency differentiation.In this context,this paper explores the scientific workflow and provenance mechanisms for scientific data management,processing,and analysis.The specific research contents and innovations are as follows:1)A multi-domain and sub-role architecture named Sci-SA(Science-Software Architecture)is proposed and used to manage multiple types of scientific data across fields.The Sci-SA is divided into four functional areas by functional differences,and the interfaces are designed using REST technology,thereby reducing the coupling between modules,then the Sci-SA integrates multi-type databases and provides support for access to third-party systems,enabling storage and sharing of heterogeneous and multi-source scientific data.Finally,in order to enhance the understanding of the architecture,the Sci-SA's assets,components,interfaces and other elements are also described in a formal way,and on this basis,the corresponding roles are designed and defined,and the Sci-SA is described based on the application context of the role in the corresponding functional area.2)A DAG-based scientific workflow model named DP-SWF and its process optimization mechanism are proposed and used to construct and optimize scientific experiments to effectively utilize multi-source scientific data.The model establishes scientific workflows that can be used in multiple fields through directed acyclic graphs and identifiers,and transparently processes the underlying layers of the model in the form of hierarchies,allowing scientists to focus on high-level scientific experiments.In addition,a process optimization mechanism is proposed.This mechanism is based on the association relationship between experimental tasks and cluster analysis using fuzzy clustering to obtain the module partitioning scheme.Based on this,the design structure is used to plan the execution order of the experimental units in each module.Finally,experiments on the correctness and effectiveness of the DP-SWF is carried out on the dataset of myExperiment database,and the results show that the DP-SWF can optimize the experimental process under the premise of meeting the needs of scientific experiments.3)A content-rich and fine-grained scientific workflow provenance model called CF-PROV is proposed and used to solve the problem of coarse-grained provenance of workflow.The CF-PROV gives a representation method based on provenance graph and provenance document,and uses it as a conversion specification and declaration from scientific workflow information to provenance information,thereby reducing the programming overhead of capturing provenance and making provenance information more standardized;then,in order to further enrich and refine the provenance information,the model divides the scientific workflow provenance into data provenance and process provenance,and improves the readability of provenance information from the data dimension and field level data deduction.Finally,experiments are carried out in four fields: astronomy,high energy physics,biology and computer science.And the results prove that the CF-PROV can indeed capture more detailed provenance information,and the storage and communication overhead is acceptable and processable.

Keywords/Search Tags:

scientific big data management, software architecture, scientific workflow, provenance

Related items

1	Research On Scientific Workflow Reuse
2	Representing meaningful provenance in scientific workflow systems
3	Managing scientific workflow provenance
4	Querying and managing OPM-compliant scientific workflow provenance
5	The Design And Implementation Of Hunan Meteorological Scientific Management Information System
6	Design And Implementation Of A Provenance Framework In Workflow System-Nebulas
7	Enabling Reproducibility of Scientific Data Flows Through Tracking and Representation of Provenance
8	Research And Implementation Of Scientific Big Data Application Execution Optimization Mechanism In Multiple Data Center Environments
9	Research On Key Techniques Of Scientific Workflows In IaaS Environment
10	Scientific Workflow Modeling And Executing Technology Research