Identification And Annotation Of Genome-wide Functional Elements Based On Deep Learning

Posted on:2017-05-14

Degree:Doctor

Type:Dissertation

Country:China

Candidate:F Liu

Full Text:PDF

GTID:1220330488955770

Subject:Biochemistry and Molecular Biology

Abstract/Summary:

PDF Full Text Request

The advent of Next-Generation Sequencing(NGS) technology makes it possible to obtain genome-wide and high-throughput sequencing data in a quicker and cheaper way, which changes scientific methods of fundamental research, applied research and clinical research and heightens the understanding of complex biological phenomena and their mechanisms, thus contributing to the rapid progress of multi-omics greatly.The Encyclopedia of DNA Elements(ENCODE) project and the Roadmap Epigenomics Project are the two most important research projects of the post-genomic era, which provide a number of genome-wide, high-throughput and multi-omics data gererated by ChIP-Seq, RNA-Seq, DNase-Seq and other sequencing methods. Multiple omics are interrelated rather than isolated and can influence each other. Different omics data reflects different aspects of the genome nature. Thus, it is necessary to combine multiple omics data from a systematic integrated perspective to efficiently take advantage of the technical differences and complementarity of various omics data, which contributes to solving biological problems from a systematic level and revealing the hidden secrets behind the life phenomenon.In order to solve the problem of identifying different types of functional genomic elements, it requires a lot of biological background knowledges about the problems to be solved. Based on these prior knowledges, researchers have to design a series of operation processes, such as filtering, merging and overlapping, to identify the corresponding functional elements, but this heavily depends on the degree of prior knowledge of related issues. In addition, researchers have also developed a number of bioinformatics algorithms and softwares to identify functional elements, and have achieved some success. However, these algorithms and softwares essentially belong to Shallow Learning, and their ability of data characterization and feature learning is limited. Therefore, their ability of integrating massive, complex, multi-omics data to find the laws is restricted greatly.In 2006, Geoffrey Hinton et al published a groundbreaking article in the Science journal, and this set off a wave of Deep Learning. Deep Learning is an extension and expansion of the artificial neural network(ANN) research. The Deep Neural Network(DNN) established by Deep Learning has excellent ability of feature learning. Through layer-wise extraction and abstraction of feature, DNN can characterize data and learn statistical laws from massive training samples, and thus can make a more accurate prediction on new unknown data.In this dissertation, according to different specific biological problems and the corresponding specific data type, and based on the solid theoretical foundation and practical experience of deep learning and other machine learning algorithms, we first designed and developed various deep learning algorithms suitable for different biological problems. Then we used the large number of genome-wide, high-throughput NGS data provided by the ENCODE project and the Roadmap Epigenomics Project to identify distinct functional elements at a genome scale based on the deep learning model from a systematic integrated perspective. Finally, we characterized various biological properties of these functional elements, including histone modifications, gene expression, transcriptional factor binging sites(TFBS), DNase I hypersensitive sites(DHS), DNA methylation, evolutionary conservation, choromatin 3D structure and RNA secondary structure, and revealed their relationship with diseases.According to the different biological problems, we completed the following several different researches:The first research is â€œThe identification and annotation of replication-timing domains in the human genome by deep learningâ€. We started this paper with the replication-timing domains in DNA replication. In order to solve the problem of the identification of different replication-timing domains, we developed a novel hybrid architecture combining a pre-trained, deep neural network and a hidden Markov model(DNN-HMM) for the de novo identification of replication domains using replication timing profiles. After performance assessment and comparision, our results demonstrate that DNN-HMM can significantly outperform traditional DNN, Gaussian mixture modelâ€“hidden Markov model(GMM-HMM) systems and other six reported methods that can be applied to this challenge. We applied our trained DNN-HMM to identify distinct replication domain types using newly replicated DNA sequencing(Repli-Seq) data across 15 human cells. A subsequent integrative analysis revealed that these replication domains harbour unique genomic and epigenetic patterns, transcriptional activity, and higher-order chromosomal structure. Based on our findings, we proposed the ?replication-domain? model. This model reveals an important chromatin organizational principle of the human genome and represents a critical step toward understanding the mechanisms regulating replication timing.The second research is â€œPrediction of human enhancers with a deep learning-based algorithmic frameworkâ€. Enhancers play a central role in spatiotemporal regulation of gene expression. However, the identification of enhancers at the genome scale is challenging in computational biology. In this study, we firstly introduced a deep learning-based algorithmic framework named PEDLA to make a comprehensive and unbiased enhancer prediction. We demonstrated that our PEDLA is not only capable of integrating massively heterogeneous data, thus making the prediction more comprehensive and accurate, but also possesses capability of handling class-imbalanced data to make the prediction unbiased and robust. Compared with five state-of-the-art machine learning methods, our PEDLA significantly outperforms them. Based on these results, we further expanded PEDLA framework to predict enhancers across multiple human cells/tissues. We practically applied the PEDLA to train on 22 training cell lines/tissues, and achieved excellent performances on both the 22 training cell lines/tissues and another 20 independent test cell lines/tissues, which demonstrated that PEDLA is a general and robust deep learning framework for enhancer predictions across diverse cell types and tissues.The third research is â€œA deep learning and ensemble learning-based algorithm for identification of RNA editing sitesâ€. The current method for identification of RNA editing sites are mainly based on the prior knowledge of RNA editing sites, and obtain RNA editing sites through a series of complex manual filtering processes. In this study, for the identification of RNA editing sites, we designed and developed a deep learning-based, bootstrapped and paralleled ensemble learning algorithm named DeepRed. DeepRed had several outstanding advantages. The first was capable of achieving the goal of identification of RNA editing sites through automatically extracting and learning feartures from the training samples. The second was that it could accurately indentify RNA editing sites from the candidate set which was the direct output of GATK software and consisted of various types of sites. The third was that DeepRed could identify RNA editing sites and SNPs at the same time. And the fourth was that the input feature for the indentification was the raw â€œATCGâ€ sequence and that DeepRed could automatically extract and abstract the more effective feature from the raw sequence. The fifth advantage was that DeepRed possessed capability of handling class-imbalanced data unbiasedly. Our result demenstrated that DeepRed achieved excellent performance in the identification of RNA editing sites. Furthermore, validation of independent experimental data also indicated that our algorithm was reliable and accurate. In addition, the assessment result in multiple cells indicated that DeepRed possessed generalization capability, and thus DeepRed could be used to identify RNA editing sites in diffenrent cells, diffenrent locations and diffenrent situations.The last is â€œThe identification and annotation of human enhacer RNAâ€. Whether eRNAs just represent transcriptional noise or carry biological functions and whether the act of transcription or the eRNA transcripts conveys the functionality is still open for debate. In this study, we identified active enhancers which transcribed enhacer RNA(eRNA) across 50 human cell types and tissues. We characterized various chromatin signatures including histone modifications, binding sites of transcriptional factors(TFs) and coactivators in active enhancers, and found that the activity of enhancers, the level of eRNAs, the level of mRNAs of the associated genes and the biological processes of Gene Ontology(GO) were correlated with each other in a cell-type-specific manner and that these cell-type-specifical biological processes of GO well defined the identities of the corresponding cell types and tissues. Furthermore, we searched and detected known and novel RNA secondary structure within eRNAs, and found a number of functional sctructual ncRNA in eRNAs including consensus secondary structure similar to miRNA. Further analysis revealed single nucleotide polymorphisms(SNPs) falling in the eRNA region has significant effects on the eRNA structure and are associated with human diseases, which potentially offer effective diagnostic and therapeutic targets for human diseases.In a word, for â€œIdentification and Annotation of Genome-wide Functional Elements Based on Deep Learningâ€, we designed and developed various deep learning algorithms suitable for the identification of different functional elements. Then we annotated these identified functional elements from a systematic integrated perspective, explored the underlying regulatory mechanism and revealed their relationship with diseases.

Keywords/Search Tags:

machine learning, deep learning, replication-timing domain, enhancer, RNA editing site, eRNA

PDF Full Text Request

Related items

1	Interpretable Enhancer Prediction Based On Deep Learning And XGBoost
2	Deep Learning Based Enhancer Regulatory Sequence Recognition Research
3	Prediction Of Enhancers And N4 Methylation Sites Based On Ensemble Learning And Deep Learning
4	Cis-regulatory Element Identification Based On Deep Learning And Ensemble Learning
5	Research And Implementation Of Deep Learning-based Prediction Of Super-enhancer-promoter Relationship
6	Research On RNA Related Function Sites Based On Machine Learning
7	Research On Word Embedding And Deep Learning Based Replication Origin And Enhancer Prediction
8	Application Of Machine Learning In Space Environment Feature Recognition And Analysis
9	Deep Learning-based Approach To Identify Enhancer-promoter Interactions
10	Research On RNA Editing Site Identification Algorithm Based On Deep Learning