Multi-versioned Data Storage and Iterative Processing in a Parallel Array Database Engine

Posted on:2015-07-18

Degree:Ph.D

Type:Thesis

University:University of Washington

Candidate:Soroush, Emad

Full Text:PDF

GTID:2478390020452217

Subject:Computer Science

Abstract/Summary:

Scientists today are able to generate data at an unprecedented scale and rate. For example the Sloan Digital Sky Survey (SDSS) generates 200GB of data containing millions of objects on each night on its routine operation. The large hadron collider is producing even more data today which is approximately 30PB annually. The Large Synoptic Survey Telescope (LSST) also will be producing approximately 30TB of data per night in a few years. Also, in many fields of science, multidimensional arrays rather than flat tables are standard data types because data values are associated with coordinates in space and time. For example, images in astronomy are 2D arrays of pixel intensities. Climate and ocean models use arrays or meshes to describe 3D regions of the atmosphere and oceans. As a result, scientists need powerful tools to help them manage massive arrays.;This thesis focuses on various challenges in building parallel array data management systems that facilitate massive-scale data analytics over arrays.;The first challenge with building an array data processing system is simply how to store arrays on disk. The key question is how to partition arrays into smaller fragments called chunks that form the unit of IO, processing, and data distribution across machines in a cluster. We explore this question in ArrayStore, a new read-only storage manager for parallel array processing. In ArrayStore, we study the impact of different chunking strategies on query processing performance for a wide range of operations, including binary operators and user-defined functions. ArrayStore also proposes two new techniques that enable operators to access data from adjacent array fragments during parallel processing.;The second challenge that we explore in building array systems is the ability to create, archive, and explore different versions of the array data. We address this question in TimeArr, a new append-only storage manager for an array database. Its key contribution is to efficiently store and retrieve versions of an entire array or some sub-array. To achieve high performance, TimeArr relies on several techniques including virtual tiles, bitmask compression of changes, variable-length delta representations, and skip links.;The third challenge that we tackle in building parallel array engines is how to provide efficient iterative computation on multi-dimensional scientific arrays. We present the design, implementation, and evaluation of ArrayLoop, an extension of SciDB with native support for array iterations. In the context of ArrayLoop, we develop a model for iterative processing in a parallel array engine. We then present three optimizations to improve the performance of these types of computations: incremental processing, mini-iteration overlap processing, and multi-resolution processing.;Finally, as motivation for our work and also to help push our technology back into the hands of science users, we have built the AscotDB system. AscotDB is a new, extensible data analysis system for the interactive analysis of data from astronomical surveys. AscotDB provides a compelling and powerful environment for the exploration, analysis, visualization, and sharing of large array datasets.

Keywords/Search Tags:

Data, Array, Processing, Iterative, Storage

Related items

1	Data Management and Data Processing Support on Array-Based Scientific Data
2	The CACHE Design In Network Storage Array
3	Research And Application Of Iterative Signal Processing Techniques
4	Research On High-speed Data Storage Technology Based On Flash Memory Array
5	Research On Local Redundant Array Code In Distributed Storage System
6	The Research Of Data Storage Technology Based On NAND FLASH Array
7	Research On Hadoop Based Iterative Data Processing And Data Placement Strategy
8	Iterative detection for page-oriented optical data storage systems
9	The Research And Design Of High Speed Solid-state Storage Based On EMMC Array
10	Research On Secondary Storage Structure In Continuous System