Optimizing Access to Scientific Data for Storage, Analysis and Visualizatio

Posted on:2019-10-26

Degree:Ph.D

Type:Dissertation

University:University of California, Santa Cruz

Candidate:Ionkov, Latchesar

Full Text:PDF

GTID:1448390002999767

Subject:Computer Science

Abstract/Summary:

Scientific workflows contain an increasing number of interacting applications, often with big disparity between the formats of data being produced and consumed by different applications. This mismatch can result in performance degradation as data retrieval causes multiple read operations (often to a remote storage system) in order to convert the data. In recent years, with the large increase in the amount of data and computational power available there is demand for applications to support data access in-situ, or close-to simulation to provide application steering, analytics and visualization.;Although some parallel filesystems and middleware libraries attempt to identify access patterns and optimize data retrieval, they frequently fail if the patterns are complex. It is evident that more knowledge of the structure of the datasets at the storage systems level will provide many opportunities for further performance improvements.;For most developers of scientific applications, storing the application data, and its particular format on disk, is not an essential part of the application. Although they acknowledge the importance of the I/O performance, their expertise lies mostly in numerical simulations and the particular models their application simulates. Most of their efforts are spent of ensuring that the it produces correct numerical results. Ideally, they would like to be able to have a library call that reads a subset of the data from storage (no matter what its format is), and place it in the data structures the simulation defines in the computer memory. Since the data needs to be analyzed and visualized, and the data has to be accessible from third-party tools, the scientists are forced to know more about the data formats.;In this dissertation we investigate multiple techniques for utilizing dataset description for improving performance and overall data availability for HPC applications. We introduce a declarative data description language that can be used to define the complete dataset as well as parts of it. These descriptions are used to generate transformation rules that allow data to be converted between different physical layouts on storage and in memory.;First, we define the DRepl dataset description language and use it to implement divergent data views and replicas as POSIX files. We evaluate the performance for this approach and demonstrate its advantages both because of the transparent application use, and combined performance when the application is combined with analytics and/or visualization code that reads the data in different format. DRepl decouples the data producers and consumers and the data layouts they use from the way the data is stored on the storage system. DRepl has shown up to 2x for cumulative performance when data is accessed using optimized replicas.;Second, we extend the previous approach to the parallel environment used in HPC. Instead of using POSIX files, the new method allows data to be accessed in larger chunks (fragments) in the way it will be laid out in memory. The developers can define what data structures they have in the process' memory and the overall format of the dataset on storage, and the runtime will automatically take care of transforming the data between the two. Both the formats in memory and on disk are described with the DRepl language. Replacing the ability for reading the data as an array of bytes with operations that use descriptions of the data structure, provides better opportunities for the storage system to optimize the access to the persistent data. The integration of this technique in Ceph demonstrates the potential advantages for this approach. The experiments show performance improvements up to 5 times for writes and 10 times for reads, compared to collective MPI I/O.;Third, we explore the future directions of extending the DRepl language to support more complex datasets. The additions would allow scientists to use different resolutions for different parts of a multi-dimensional spaces, and define how to transform the data between resolutions. The changes would also allow completely abstract definitions of datasets not only for continuums, but also for primitive types like real and integer numbers. The fragments of the dataset that are present in memory or disk will have concrete types that are compatible with the abstract types used in the dataset.;Finally, we provide foundations on how to extend the previous functionality to the most complicated data structures used in scientific applications -- unstructured meshes.

Keywords/Search Tags:

Data, Scientific, Applications, Storage, Access, Used, Performance, Format

Related items

1	Comparing HDF5 and NetCDF for scientific data conversion and storage
2	Research On Scientific Literature And Scientific Data Storage Retrieval Based On Elastic Search
3	Research On Converged Data Management Techniques For High-performance Computing Systems
4	Specification, configuration and execution of data-intensive scientific applications
5	Conversion Technology Of Three-dimensional Models' Representation And The Custom Storage Format
6	Improved Algorithm And Performance Optimization Of Distributed Storage System Based On NoSQL
7	Evaluating And Optimizing Scientific Applications On Many-Core Platforms
8	Data management, storage and access optimizations in high performance distributed environment
9	Research On Storage Quality Of Service For Cloud Data Centers
10	End-to-end Noncontiguous Access Pattern Optimization for Extreme-scale Scientific Data Analytics