NGSdb: A NGS Data Management and Analysis Platform for Comparative Genomics

Posted on:2016-01-04

Degree:Master'

Type:Thesis

University:University of Washington

Candidate:Cobb, Marea

Full Text:PDF

GTID:2478390017478063

Subject:Bioinformatics

Abstract/Summary:

As researchers continue to expand the volume of Next Geeneration Sequencing data, the ability to store and query the data becomes increasingly important. The current approach of using spreadsheets has become too complex and the data too vast to efficiently store, view, cross query, analyze, and share among collaborators. We have created and implemented a relational database schema, NGSdb (PostgreSQL), coupled with a user-friendly web interface (Django/Python), to address this growing need. NGSdb currently has five core components: a sample core, which tracks the sample information (e.g., organism, growth phase); a library core, which tracks the libraries constructed from samples (e.g., library type, sequencing method, raw data files); a genome core which stores information about reference genomes; an analysis core, where the meta-information of bioinformatics analyses are stored; and a result core where the results of the bioinformatic analyses are stored. I have expanded NGSdb by developing two analysis modules; a somy/CNV module and SNP module. In addition to storing and retrieving the data, the web interface also serves as an analytical platform. The database is designed to be modular, allowing for future additions as new technology or data becomes available. The modularity enables us to query across our different data types, such as SNP data and RNA-Seq data (e.g., how does the expression level change when a gene is mutated?). We demonstrate the capabilities of our system through two separate case studies. The first recapitulates a recently published genomic analysis of two Sri Lankan strains of Leishmania donovani, one causing visceral disease (VL) and one causing cutaneous disease (CL). The second case study compares the genome of a laboratory-adapted strain of L. donovani with genetically modified clones derived from it: single (sKO) and double (dKO) deletions of the dpkAR1 gene; and a derivative of dKO line that had recovered the wild type growth phenotype. We identified single nucleotide polymorphisms (SNP), copy number variation (CNV), and somy differences between these lines to expose what genomic differences may contribute to the growth phenotype recovery of the double knockouts. NGSdb successfully recaptured the analysis results previously published and identified a potential artifact in the second study. Through these analysis we have also established additions to NGSdb that we believe will further increase the usability of the system.

Keywords/Search Tags:

Data, Ngsdb

Related items

1	Design And Implementation Of The Inconsistent Data Repairing Subsystem In The Data Cleaning System
2	Seismic Achievement Data ETL Platform Architecture Design And Software System Implementation
3	The Research And Application Of Data Preprocessing In XML Data Warehouse
4	The Data Integration、analysis And Utilization For Hosiptal Information Based On The Data Warehouse
5	Research On Related Issues Of Unstructured Data
6	Design And Implementation Of Data Mining Support Subsystem Based On Big Data Of Power
7	Design And Implementation Of Environmental Monitoring Data Management System
8	Study On Data Dependency_Based Data Quality Processing Techniques In Data Integration
9	Research On The Problems And Countermeasures Of Domestic Data Journalism Practice
10	Design And Implementation Of Data Service System Oriented To Consumer Finance