Font Size: a A A

Compilation, locality optimization, and managed distributed execution of scientific dataflows

Posted on:2009-06-17Degree:Ph.DType:Dissertation
University:University of California, IrvineCandidate:Wang, Daniel LiweiFull Text:PDF
GTID:1448390005951805Subject:Engineering
Abstract/Summary:
Supercomputing and other high-performance computing technologies have succeeded in achieving high computational throughput in geoscience atmospheric, land, and ocean modeling, but have ignored problems in processing and analyzing the resultant model predictions at a similar scale. Reductive data analysis is severely limited by the financial and temporal costs of large scale data transfer. Scientific workflow frameworks enable scientists to leverage grid-scale resources, but are too complex for individual scientists to use, despite the availability of graphical tools.;In order to address the quickly-growing amount of data and the growing desire to share and use each other's data, this research has made three major contributions. One, shell compilation is introduced as a feasible method for optimizing, sandboxing, and porting shell scripts, which are programs of programs. Shell compilation allows scientists to reuse their existing analysis scripts and exploit parallel and distributed computing technology with minimal, if any, porting effort. The application of standard compilation techniques at this higher-level is described, noting the new semantic differences and potential benefits (automatic program-level parallelism) that arise. Two, the ability to compile scripts is applied in geoscience to automatically convert scripts to scientific workflows, resulting in the ability to transparently distribute computation to remote data servers and reduce or eliminate unnecessary data download. The resulting system, the Script Workflow Analysis for MultiProcessing (SWAMP) system dynamically schedules and executes workflows, dispatching commands among cluster machines paying particular attention to data locality and minimizing internal data transfer---a feature particularly important for data-intense workloads. Performance is shown effective in real geoscience data reduction analysis scripts. Third, the characteristics of I/O-constrained workloads are analyzed and described, along with a technique for explicitly caching files in-memory and a new partitioning algorithm, Independent Set Partitioning (InSeP), whose simple high-level approach based on set operations can be applied on dynamically-scheduled workflows.
Keywords/Search Tags:Data, Compilation, Scientific
Related items