Font Size: a A A

Data Management and Data Processing Support on Array-Based Scientific Data

Posted on:2016-04-09Degree:Ph.DType:Dissertation
University:The Ohio State UniversityCandidate:Wang, YiFull Text:PDF
GTID:1478390017476375Subject:Computer Science
Abstract/Summary:
Scientific simulations are now being performed at finer temporal and spatial scales, leading to an explosion of the output data (mostly in array-based formats), and challenges in effectively storing, managing, querying, disseminating, analyzing, and visualizing these datasets. Many paradigms and tools used today for large-scale scientific data management and data processing are often too heavy-weight and have inherent limitations, making it extremely hard to cope with the `big data' challenges in a variety of scientific domains.;Our overall goal is to provide high-performance data management and data processing support on array-based scientific data, targeting data-intensive applications and various scientific array storages. We believe that such high-performance support can significantly reduce the prohibitively expensive costs of data translation, data transfer, data ingestion, data integration, data processing, and data storage involved in many scientific applications, leading to better performance, ease-of-use, and responsiveness.;On one hand, we have investigated four data management topics as follows. First, we built a light-weight data management layer over scientific datasets stored in HDF5 format, which is one of the popular array formats. Unlike many popular data transport protocols such as OPeNDAP, which requires costly data translation and data transfer before accessing remote data, our implementation can support server-side flexible subsetting and aggregation, with high parallel efficiency. Second, to avoid the high upfront data ingestion costs of loading large-scale array data into array databases like SciDB, we designed a system referred to as SAGA, which can provide database-like support over native array storage. Specifically, we focused on implementing a number of structural (grid, sliding, hierarchical, and circular) aggregations, which are unique in array data model. Third, we proposed a novel approximate aggregation approach over array data using bitmap indexing. This approach can operate on the compact bitmap indices rather than the original raw datasets, and can support fast, accurate and flexible aggregations over any array or its subset without data reorganization. Fourth, we extended bitmap indexing to assist the data mining task subgroup discovery over array data. Like the aggregation approach, our algorithm can operate entirely on bitmap indices, and it can efficiently handle a key challenge associated with array data - a subgroup identified over array data can be described by value-based and/or dimension-based attributes.;On the other hand, we focused on both offline and in-situ data processing paradigms in the context of MapReduce. To process disk-resident scientific data in various data formats, we developed a customizable MapReduce-like framework, SciMATE, which can be adapted to support transparent processing on any of the scientific data formats. Thus, unnecessary data integration and data reloading incurred by applying traditional MapReduce paradigm to scientific data processing can be avoided. We then designed another MapReduce-like framework, Smart, to support efficient in-situ scientific analytics in both time sharing and space sharing modes. In contrast to offline processing, our implementation can avoid, either completely or to a very large extent, both data transfer and data storage costs.
Keywords/Search Tags:Scientific, Data processing, Data management, Data transfer, Array data, Data storage
Related items