Font Size: a A A

Foundations for Provenance-Aware Systems

Posted on:2011-09-03Degree:Ph.DType:Dissertation
University:Harvard UniversityCandidate:Muniswamy-Reddy, Kiran-KumarFull Text:PDF
GTID:1468390011971277Subject:Computer Science
Abstract/Summary:
Digital provenance is metadata that describes the ancestry or history of a digital object. Provenance enhances the value of the data it describes as it provides answers to questions such as: How was this object created? On what other objects does this object depend? How do the ancestries of these two objects differ?;In many digital systems, provenance collection is either entirely missing, thus losing valuable information, or is recorded as an after thought, risking inconsistency between data and provenance. This dissertation demonstrates that storage systems are well-suited for automatically inferring and managing provenance. Accordingly, we introduce the Provenance-Aware Storage System (PASS), a storage system that automatically collects and maintains the provenance of files. We describe the challenges in building a PASS and present an architecture for collecting provenance in local file systems. The provenance that PASS collects is useful for scientific documentation, debugging, security, search, and information lifecycle management. PASS imposes reasonable overheads, with maximum 23% observed elapsed time overhead.;We then extend PASS to the more semantically rich domains of applications. Ultimately, we provide the disclosed provenance API, an interface that supports and encourages the integration of multiple provenance collection substrates, each operating at a particular abstraction layer. By integrating the provenance collected by PASS, a workflow engine, a web browser, and a runtime Python provenance tracking wrapper, we demonstrate that this cross-layer integration provides powerful new functionality unavailable by other means.;While cross-layer provenance integration demonstrates how the PASS architecture can be extended up the application stack, we demonstrate versatility of the architecture by extending it to network attached stores (NAS) and cloud stores. To demonstrate the functionality of the architecture in a NAS, we augmented the NFS protocol with additional operations. Our augmented NFS protocol has reasonable overheads, with maximum 16.8% observed elapsed time overhead. To demonstrate the functionality of the architecture in a cloud, we designed protocols that store provenance with data on the cloud. Our cloud protocol overheads are minimal with overheads less than 10% in most cases.
Keywords/Search Tags:Provenance, PASS, Data, Systems, Overheads, Cloud
Related items