Font Size: a A A

Bioinformatics Research For Omics Big Data

Posted on:2017-04-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:S YangFull Text:PDF
GTID:1220330488455792Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
With the rapid development of high-throughput technologies such as DNA sequencing and mass spectrometry, the life sciences field is entering an era of big data characterized by the massive and multi-dimensional omics data. These omics data create opportunities for better understanding of human gene function, pathological mechanism, and precision medicine, and so on. Researches for omics big data in bioinformatics promote the development of efficiently mining biological knowledge from omics big data. Big data can be described by the characteristics of “Volume”, “Variety” and “Velocity”, bringing up new challenges to bioinformatics. For data computing, it is challenging to laboratories lacking computing resources. For data analysis, it calls for the integration of omics data to answer biological questions. Lack of bioinformatics tools is the major bottleneck in the era of big data. This thesis aims to address these challenges by using data computing and data analysis technologies for the analysis of omics big data.In terms of data computing technology, we focus on using cloud computing technologies to solve several big data problems in proteomics. Amazon Web Service(AWS) offers a pay-as-you-go pricing model, and provides several cloud services such as Elastic Compute Cloud(EC2) and Simple Storage Service(S3). AWS delivers IT resources to users via the Internet, and allows users to instantly scale to meet the needs of omics big data computing. MapReduce framework is composed of Map and Reduce procedures: The Map procedure divides the input data into splits, assigns the data splits to each Map function as key/value pairs, and stores the output result on the corresponding computing nodes as key/value pairs. The Reduce procedure iterates through the keys and produces the final result. The MapReduce framework simplifies the programming model of distributed data computing. To address the challenge of the increasing computing complexity of MS/MS data, we use cloud computing technology to solve big data problems in proteomics, such as novel peptide identification, single amino acid variant peptide identification, and exon-skipping splicing peptide identification, and so on. We use proteogenomics method to analyze MS/MS data, and construct reference amino acid sequence database of six-frame translation of the entire human genome, missense mutation, and exon skipping splicing. We use Map Reduce-based database search algorithm to accelerate the process of MS/MS-based peptide identification, and use Target-Decoy strategy to control the quality of peptide-level identified results. We have developed CAPER 3, a scalable cloud-based system for data-intensive analysis of MS/MS-based proteomics data sets. CAPER 3 is based on cloud computing technologies such as AWS and MapReduce. CAPER 3 is composed of two core packages: a remote work package(RWP) and a local work package(LWP). The RWP deals with tasks running in the cloud, and is deployed to AWS EC2. LWP is a java-based client and deals with local operations. There are four main components in the client, including data transfer, job configuration, progress tracking, and result visualization. It allows users without expertise in cloud computing to perform scalable data computing task. CAPER 3 offers bioinformatics solution to four data-intensive problems, namely, novel peptides detection, known single amino acid variants identification, sample-specific single amino acid variants identification, and exon-skipping events identification. CAPER 3 is capable of accelerating the analysis of the data-intensive problems, and the analysis result provides protein-level evidence for novel genes or variants. We hope that CAPER 3 will facilitate big data processing in proteomics. The source code of CAPER 3 is available at https://github.com/ys-amms/CaperCloud, and the manual is available at http://prodigy.bprc.ac.cn/caper3.In terms of data mining technology, we focus on establishing a network-based system for omics data integration. Single omics data analysis is limited to understanding the biological system. Omics big data generated from multiple high-throughput sequencing platforms reveal static and dynamic information of the molecules in the cell, and the molecules in the cell communicate with each other to form biological networks. Taking consideration of differences and relations between different types of omics data can facilitate the priorization of disease genes. Identifying cancer driver genes is a key challenge in bioinformatics. We collected cancer-related mutations, and found that there was a significant difference between cancer mutations and neutral mutations in PAM250-based or Channon entropy-based sequence features. For network-level information, we proposed a pathway-based algorithm to identify driver genes. The pathway-based approach assumes that the candidate driver mutations would constantly activate the pathway branch, leading to continuing downstream gene expression that may lead to cancer; The PFIN-based approach assumes that the candidate driver mutations would contribute to cell state changes measured by differentially expressed genes, which can be explained using the network information. We have developed an R package called “Bionexr” for integrative network-based analysis of gene somatic mutation and gene expression data to identify cancer drivers. Bionexr consistes of four modules, which are “Data Download”, “Gene Analysis”, “Network Analysis”, and “Visualization”. Firstly, the “Data Download” module can download level-3 gene somatic mutation and expression data from TCGA. This module also supports progress feedback and broken transfer resume. Secondly, “Gene Analysis” module handles both gene mutation and gene expression data. This module predicts the functional impact of a mutation based on conservation patterns for each amino acid residue mutation, and calculates gene expression fold change between normal and tumor samples. “Network Analysis” module contains the pathway-based and PFIN-based approaches for identifying cancer drivers. Finally, the “Visualization” module can display the result in a network view. Particularly, the result of pathway-based analysis is displayed in three-level(mutated gene level, transcription factor level, and target gene level) hierarchical network, and the result of PFIN-based analysis is displayed in undirected network. The combination of “Gene Analysis” and “Network Analysis” module allows users to consider both nucleotide-level and network-level information to identify cancer drivers. To test the validity of Bionexr, we analyzed several TCGA cancers such as Head and Neck Squamous Cell Carcinoma(HNSC), Breast invasive carcinoma(BRCA), Kidney renal clear cell carcinoma(KIRC) and Uterine Corpus Endometrial Carcinoma(UCEC), and identified potential cancer genes or pathways. We expect that Bionexr will facilitate cancer data analysis by generating easily interpreted results. Bionexr is available at https://github.com/ys-amms/bionexr.The main challenge of omics big data mining in bioinformatics is how to efficiently extract biological knowledge from huge volume of omics data. Cloud computing technology and biological network can help address the challenge. In the future, we expect that the integrative analysis of omics big data and literatures will shed light on the research of omics big data mining.
Keywords/Search Tags:Bioinformatics, Cloud computing, Proteomics, Omics data integration, Biological network
PDF Full Text Request
Related items