Font Size: a A A

Large scale data management for the sciences

Posted on:2009-07-23Degree:Ph.DType:Thesis
University:The Johns Hopkins UniversityCandidate:Malik, TanuFull Text:PDF
GTID:2448390005955798Subject:Computer Science
Abstract/Summary:
Traditional enterprises and novel scientific applications are accumulating petabyte-scale datasets, which makes the need for large-scale data management more pressing than ever. Geographic distribution of the datasets accompanied by complex demands on data makes large-scale data management challenging. This is especially true for sciences that model complex physical and biological phenomena using data from multiple sources.; This dissertation addresses two critical problems for data management of scientific datasets: combining large number of diverse data sources for execution of scientific queries and executing data-intensive scientific queries efficiently, in terms of both network and I/O. As a first step towards scientific data management, this thesis describes design and specification of SkyQuery, a system that federates data seamlessly from several petabyte size, autonomous and heterogeneous Astronomy databases scattered worldwide. Using SkyQuery, scientists can write declarative queries that compare and merge multiple astronomical datasets. For efficient query execution and scalability, we propose Bypass-Yield Caching---a novel caching framework for database systems that dramatically reduces the network bandwidth requirements of data-intensive federations such as SkyQuery making them good network citizens. Our description of the bypass-yield cache includes novel cache evaluation metrics and several innovative algorithms. Distributed applications such as the Bypass Yield Cache often rely on a-priori knowledge of query cardinalities to make cache optimization decisions. In this context, we present a black-box approach to cardinality estimation that is suitable for distributed applications.; All our techniques are general in that they can be adapted to different scientific domains such as life and earth sciences where similar data management problems abound. The success of SkyQuery and its adoption by the National Virtual Observatory (NVO) is an example of data management systems enabling scientific endeavors.
Keywords/Search Tags:Data management, Scientific, Sciences, Datasets
Related items