Font Size: a A A

Processing, Analysis And Modeling On High-Throughput Genomic Data

Posted on:2013-02-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:C M WangFull Text:PDF
GTID:1110330362458388Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
With the rapid development of biological sciences, a large amount of data has been generated. How to explore the valuable knowledge has become a major topic in bioinformatics and computational biology research. This thesis study focuses on high-throughput genomic data with regard to their processing, analysis, and modeling.The following important findings have been made.1. With the advent of DNA sequencing technologies, more and more reference genome sequences are available for many organisms. Analyzing sequence variation and understanding its biological importance are becoming a major research aim. However, how to store and process the huge amount of eukaryotic genome data, such as those of the human, mouse and rice, has become a challenge to biologists. Currently, available bioinformatics tools used to compress genome sequencing data have some limitations, such as the requirement of the reference single nucleotide polymorphisms (SNPs) map and information on deletions and insertions. Here, we present a novel compression tool named GRS for storing and analyzing Genome ReSequencing data, named GRS. GRS is able to process the genome sequencing data without the use of the reference SNPs and other sequence variation information and automatically rebuild the individual genome sequencing data using the reference genome sequence. When its performance was tested on the first Korean personal genome sequencing data set, GRS was able to achieve 159-fold compression, reducing the size of the data from 2986.8 to 18.8 MB. While being tested against the sequencing data from rice and Arabidopsis thaliana, GRS compressed the 361.0 MB rice genome data to 4.4 MB, and the A. thaliana genome data from 115.1 MB to 6.5 KB. This de novo compression tool is available at http://gmdd.shgmo.org/Computational-Biology/GRS.2. ChIP-Seq, which combines chromatin immunoprecipitation (ChIP) with high-throughput massively parallel sequencing, is increasingly being used for identification of protein-DNA interactions in vivo in the genome. However, to maximize the effectiveness of data analysis of such sequences, new algorithms that are able to accurately predict DNA-protein binding sites need to be developed. Here, we present SIPeS (Site Identification from Paired-end Sequencing), a novel algorithm for precise identification of binding sites from short reads generated by paired-end solexa ChIP-Seq technology. We used this method on the ChIP-Seq data from the Arabidopsis basic helix-loop-helix transcription factor ABORTED MICROSPORES (AMS), which is expressed in anther during pollen development. Our results show that SIPeS has better resolution for binding site identification compared to two existing ChIP-Seq peak detection algorithms, Cisgenome and MACS. Moreover, SIPeS is designed to accurately calculate the mappable genome length with fragment length based on the paired-end reads. Dynamic baselines are also employed to effectively discriminate closely adjacent binding sites for effective binding site discovery, which is of particular value when working on genomes with high gene density. This de novo tool is available at http://gmdd.shgmo.org/Computational-Biology/ChIP-Seq/download/SIPeS, and current version is 2.0.3. Protein interactions are essential in the molecular processes occurring within an organism and are utilised in network biology to help organise and understand biological complexity. Currently, there are more than 10 publically available Arabidopsis protein interaction databases. However, there are limitations with these databases, including different types of interaction evidence, a lack of defined standards for protein identifiers, and the use of other non-standard information. To effectively integrate the different datasets and maximise access to available data, this paper presents an interactive bioinformatics web tool, ANAP (Arabidopsis Network Analysis Pipeline). ANAP has been developed for Arabidopsis protein interaction integration and network-based study, to facilitate functional protein network analysis. ANAP integrates 11 Arabidopsis protein interaction databases, comprising a total of 201,699 unique protein interaction pairs, 15,208 identifiers (include 11,931 TAIR AGI code), 89 interaction detection methods, 73 species interacting with Arabidopsis and 6161 references. ANAP can be used as a knowledge base for constructing protein interaction networks based on a user input and supports both direct and indirect interaction analysis. It has an intuitive graphical interface allowing easy network visualisation and provides extensive detailed evidence for each interaction. In addition, ANAP displays the gene and protein annotation in the generated interactive network with links to the TAIR, AtGenExpress Visualization Tool (AVT), Arabidopsis 1001 Genomes GBrowse (1001 Genomes), Protein Knowledgebase (UniProtKB), Kyoto Encyclopedia of Genes and Genomes (KEGG) and Ensembl Genome Browser (EnsemblGenomes) to significantly aid functional network analysis. The tool is available open access at http://gmdd.shgmo.org/Computational-Biology/ANAP/ANAP_V1.0.4. Safety assessment of genetically modified (GM) crops is a key step from research of transgenic crops to commercialization. Molecular characterization, including analysis of the integrated site, flanking sequence, and copy numbers of insertion, provides the most basic and important data to safety assessment. Development of high-throughput analyzing methods for molecular characterization of GM crops proves to be advantageous over conventional methods, such as southern blotting, polymerase chain reaction (PCR), fluorescence in situ hybridization (FISH), and genomic walking. In this work, we developed a high throughput and accurate method based on the paired-end sequencing technique to reveal the molecular features of GM rice at the genome-wide level. One transgenic rice event T1C-19 was selected to test the applicability of the developed method. The integrated sites in Chr04 and Chr11 were clearly revealed for two transgenes, and the sequences surrounding the integration sites were easily identified using conventional PCR and Sanger sequencing.
Keywords/Search Tags:resequencing data compression, Chromatin ImmunoPrecipitation with Sequencing (ChIP-Seq), protein interaction network, Genome ReSequencing (GRS), Site Identification from Paired-end Sequencing (SIPeS), Arabidopsis Network Analysis Pipeline (ANAP)
PDF Full Text Request
Related items