Font Size: a A A

The construction and usage of a microarray data warehousing system

Posted on:2009-05-10Degree:Ph.DType:Dissertation
University:University of California, Los AngelesCandidate:Day, Allen JasonFull Text:PDF
GTID:1448390002496619Subject:Bioinformatics
Abstract/Summary:
The human genome project, which began in 1988 and completed in 2001, ushered in a new era of biology. This effort accelerated the development of biochemical assay and information technologies. As a result, biologists are now able to ask questions that were previously considered intractable. One example of a breakthrough in assay technology is the DNA microarray, a high-throughput measurement device which enables individual scientists to rapidly and simultaneously interrogate the RNA concentration levels of virtually all genes in the human genome for a single biological source. As with all advances, the advent of DNA microarray has created a new frontier of challenges. In this document, I describe an approach that addresses the problem of assembly, processing, and subsequent analysis of large volumes of data collected with DNA microarray. My work is presented in 5 chapters and 3 appendices.;Chapter 1 serves as a general introduction DNA microarray assay technology, idiosyncrasies of using this technology in biological experiments, methods for preprocessing the resulting experimental data, and techniques used in the informatic systems that enable the processing, representation, storage, and subsequent retrieval of these data.;Chapter 2 is the core of the dissertation and describes the Celsius project, a microarray data warehousing system that is an implemented solution to the informatic problems described in 1. The completion of the Celsius project brought into existence the single largest publicly available source of primary and uniformly pre-processed DNA microarray data.;Chapter 3 builds upon Chapter 2 by describing an analysis of the data present in Celsius. Specifically, it describes the creation of gene-gene correlation matrices and their application in performing gene annotation and identifying disease genes within known linkage regions. While the idea of using gene-gene coexpression patterns is as old as DNA microarray technology itself, the scale of this analysis is unprecedented and the demonstrated applicability of the correlation data to a broad set of biological questions raises concerns about the validity of current microarray data deposition systems which rely heavily on experimental metadata.;Chapter 4 presents Biopackages.net, a technical subsystem of the data warehousing system described in Chapter 2. Reproducibility is a shared pillar of both scientific and data warehousing methods. Because Celsius is very dependent on computing systems to process the data stored in the warehouse, it was essential to have a mechanism for making uniform and reproducible computing environments. This not only allows the system to scale as the volume of data inevitably increases, but also garners the benefits of being able to clone the system at other sites and to recover from failures.;Chapter 5 and Appendix A describe efforts for data modeling and dissemination. As we enter the post-genome era, new assay technologies continue to appear, and the growth in volume of existing and new data generated from each technology continues to accelerate. Thus, it imperative that protocols be developed for the encoding and distribution of these data to both individual scientists and the information systems and agents acting on their behalf.;Appendix B and Appendix C present analyses performed on previous iterations of Celsius, which is described in Chapter 2. These early collaborations provided a glimpse of the utility of creating a micorarray data warehouse, without which the work described here would never have been completed.
Keywords/Search Tags:Data, System, New, Described
Related items