Font Size: a A A

Specification, configuration and execution of data-intensive scientific applications

Posted on:2011-06-16Degree:Ph.DType:Thesis
University:The Ohio State UniversityCandidate:Kumar, Vijay SFull Text:PDF
GTID:2448390002960457Subject:Engineering
Abstract/Summary:
Recent advances in digital sensor technology and numerical simulations of real-world phenomena are resulting in the acquisition of unprecedented amounts of raw digital data. Terms like 'data explosion' and 'data tsunami' have come to describe the uncontrolled rate at which scientific datasets are generated by automated sources ranging from digital microscopes and telescopes to in-silico models simulating the complex dynamics of physical and biological processes. Scientists in various domains now have secure, affordable access to petabyte-scale observational data gathered over time, the analysis of which, is crucial to scientific discoveries and furthering of knowledge within the domain. The availability of commodity components have fostered the development of large distributed systems with high-performance computing resources to support the execution requirements of scientific data analysis applications. Increased levels of middleware support over the years have aimed to provide high scalability of application execution on these systems. However, the high-resolution, multi-dimensional nature of scientific datasets, and the complexity of analysis requirements present challenges to efficient application execution on such systems. Traditional brute-force analysis techniques to extract useful information from scientific datasets may no longer meet desired performance levels at extreme data scales.;This thesis builds on a comprehensive study involving multi-dimensional data analysis applications at large data scales, and identifies a set of advanced factors or parameters to this class of applications which can be exploited in domain-specific ways to obtain substantial improvements in performance. Factors like the on-disk layout of datasets and the mechanisms for accessing them, and the mapping of analysis processes to computational resources can be customized for performance based on our knowledge of an application's computational and I/O properties. A useful property of these applications is their ability to operate at multiple performance levels based on a set of trade-off parameters, while providing different levels of quality-of-service (QoS) specific to the application instance. To avail the performance benefits brought about by such factors, applications must be configured for execution in specific ways for specific systems. Middleware support for such domain-specific configuration is limited, and there is typically no integration across middleware layers to this end. Low-level manual configuration of applications within a large space of solutions is error-prone and tedious.;This thesis proposes an approach for the development and execution of large scientific multi-dimensional data analysis applications that takes multiple performance parameters into account and supports the notion of domain-specific configuration-as-a-service. My research identifies various aspects that go into the creation of a framework for user-guided, system-directed performance optimizations for such applications. The framework seeks to achieve this goal by integrating software modules that (i) provide a unified, homogeneous model for the high-level specification of any conceptual knowledge that may be used to configure applications within a domain, (ii) perform application configuration in response to user directives, i.e., use the specifications to translate high-level requirements into low-level execution plans optimized for a given system, and (iii) carry out the execution plans on the underlying system in an efficient and scalable manner. A prototype implementation of the framework that integrates several middleware layers is used for evaluating our approach. Experimental results gathered for real-world application scenarios from the domains of astronomy and biomedical imaging demonstrate the utility of our framework towards meeting the scientific performance requirements at very large data scales.
Keywords/Search Tags:Scientific, Data, Applications, Execution, Performance, Configuration, Large, Specific
Related items