Font Size: a A A

Provenance-aware framework for lightweight capture and high quality data regeneration

Posted on:2015-05-22Degree:Ph.DType:Dissertation
University:Indiana UniversityCandidate:Ghoshal, DevarshiFull Text:PDF
GTID:1478390020452789Subject:Computer Science
Abstract/Summary:
The provenance, or derivation history, of a dataset is additional data that is necessary for making determinations of data quality, and for repeatability of the science behind the results, leading to data regeneration. Existing models and techniques for provenance capture identify and record provenance only during the execution of an experiment. These mechanisms are coarse grained, frequently unable to capture the exact mapping between the inputs and outputs of an experiment and capture only the processes and the inputs to the processes that generate the data. But the underlying programs change and the applications that create an experiment are frequently regenerated with different configurations or parameters. Additionally, provenance capture from different programming environments and execution platforms have so far required human intervention, manually annotating applications that can result in incorrect or missing information. Currently, no generic model or framework exists to automatically identify and capture provenance, addressing these issues that are required for regenerating the data from a wide range of applications running on different environments.;This dissertation proposes a unified model of provenance capture for high quality data regeneration and builds a framework for automatically capturing provenance with extremely low overhead. The generic framework captures provenance for both the application and the data generated by the application. A combination of compilers, a runtime system and a rule-engine is used for collecting provenance at different levels of application deployment and execution. This dissertation highlights the generality of this framework and studies the viability of capturing provenance through static (compile-time) and dynamic (runtime) components. We evaluate our methodology for different types of applications and benchmarks and show that capturing static and dynamic provenance contributes to high quality data analysis and regeneration. Finally, we study and recommend provenance capture approaches taking into consideration application and platform limitations.
Keywords/Search Tags:Provenance, Data, Quality, Capture, Regeneration, Framework, Application
Related items