Font Size: a A A

An adaptable repository for complex scientific metadata

Posted on:2011-03-05Degree:Ph.DType:Thesis
University:Indiana UniversityCandidate:Jensen, ScottFull Text:PDF
GTID:2448390002958210Subject:Computer Science
Abstract/Summary:
The explosive growth in computational science has resulted in a broad spectrum of scientific communities realizing the need to capture and preserve the deluge of data being generated and the metadata that describe them. Metadata is recognized as being as essential - leading to detailed metadata specifications defined as XML schemata. However, although the volume of data keeps growing, the metadata that describes it has not kept pace. In part this is due to the incentive misalignment between when metadata is generated and when it has value. Metadata is ephemeral and must be captured as an experiment runs, but the value of the data (and the metadata used to describe or use it) is often unknown -- possibly for decades.;Existing middleware for cataloging metadata across scientific domains necessarily takes a generic approach and cannot communicate using domain-specific schemata without customized middleware. This dissertation presents a different approach based on the thesis that although scientific metadata schemata are domain-specific, they share commonalities that differentiate them from other schemata, metadata in non-scientific domains, or general XML. Key characteristics of scientific metadata schemata that we identify are their composition based on unordered independent concepts, and the need to incrementally capture metadata based on concepts. Additionally, unlike data communicated as XML, scientific discovery metadata serves as a search index to locate relevant data sets. Based on these commonalities in both the structure and use of scientific metadata, we show that scientific metadata schemata can be partitioned into sets of unordered metadata concepts -- enabling a global ordering of concepts that we exploit in a generalized framework that is a hybrid of approaches used to store XML. This hybrid approach enables both detailed search queries over the metadata and efficient reconstruction of XML in response to queries. This approach is validated through the XMC Cat metadata catalog which uses a lightweight SOA-based architecture and can be deployed for varied scientific schemata through configuration instead of customized middleware. We present a prototype of the XMC Cat Builder which guides the user through generating the necessary configuration based on a domain XML schema using a point-and-click, web-based interface.
Keywords/Search Tags:Metadata, Scientific, XML
Related items