Font Size: a A A

Deep Learning Based Enhancer Regulatory Sequence Recognition Research

Posted on:2018-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:B T YangFull Text:PDF
GTID:2350330518465284Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
With the Next-Generation Sequencing technology(NGS)to bring a major innovation in the field of sequencing,biology research in various fields,thanks to NGS technology,can quickly and cheaply obtain high-throughput sequence data.This progress has completely changed the previous researchers for basic research,clinical research methods.At the same time,the massive data makes the new storage methods and calculation methods continue to put forward.Research methods used to focus on biochemical experiments.But now,it has been slowly shifted to focus on experimental data analysis.The omics analysis or multiple omics analysis,which previous required large amounts of data have now become possible.It accelerats the understanding of the mechanisms of complex life phenomena.The explosive growth of the data makes the researchers realize that there is a need for new forms of knowledge to help future generations to better understand the current research progress.At the same time,the deep meaning of the data requires people to repeatedly mining the accumulated data.Therefore,the Human Genome Project(HGP)is an important milestone in the field of biology,which is dedicated to reading the entire genome of the human genome.However,our purpose is not only here,but more importantly,to understand the various functions implied in the DNA sequence.Subsequently,the Roadmap Epigenomics Project and the Encyclopedia of DNA Elements Project(ENCODE)are two important projects for exploration of genetic mysteries.These two projects collected a large number of data from multiple groups of DNase-Seq,RNA-Seq,Ch IP-Seq and other experimental data.The study of single omics in the field of biology is often not independent,it only reflects the nature of the individual aspects of the genome.But there is a very close relationship between multi omics.How to combine different sets of data from a systematic,holistic perspective has become one of the most important research methods in the field of bioinformatics.Over the past 40 years,it has been shown that there are a series of cis-regulatory sequences on the DNA sequence.If some gene mutations occur in the region of these regulatory elements,the final phenotype will lead to differences.Among them,cisregulatory elements is the key to activate and maintain the occurrence of transcription.In-depth understanding of cis-regulatory elements is important for understanding the mechanism of life activities,the causes of human disease,and the conservation of species.Enhancers are distal cis-acting DNA regulatory elements that play key roles in gene expression in a time-or cell-line-specific manner.Understanding the properties,genomic targets and regulatory activities of enhancers is currently an area of great interest,given the increasing appreciation of their importance in development,cell identity;Whyte,et al.,2013),phenotypic diversity,evolution and human disease.Given the absence of common sequence features,the distal location from their regulated targets and their high cell type/tissue specificity,the accurate identification of enhancers remains a significant challenge in the annotation of mammalian genomes.In recent years,the advent of deep sequencing has enabled the development of a large variety of computational methods for enhancer identification.These enhancer identification method integrating different type data derived from different data sources.Based on the available data sources,enhancer identification methods can be grouped into three categories in a conceptually simple manner;however,different computational methods rely on an integration of different data sets/features and/or a combination of supervised and unsupervised components.The first category includes bioinformatics approaches that identify enhancers using epigenetic profiles,such as histone markers derived from ChIP-seq,DNase I hypersensitivity sites(DHSs)and/or transcription factor-binding sites(TFBSs),mainly through clustering and unsupervised learning techniques.The second category of methods reformulates the enhancer identification problem as a binary classification task by discriminating enhancer regions from nonenhancer(negative set)regions using supervised machine learning techniques,such as support vector machines(SVMs),artificial neural networks(ANNs),decision trees(DTs),random forests(RFs),probabilistic graphical models(PGMs),and,more recently,deep learning.The third category represents a variety of bioinformatics methods based on high-resolution data derived from enhancer testing and screening methods to detect and test enhancers in human,mouse,flies and yeast.However,despite major efforts to develop accurate enhancer prediction methods,these bioinformatics methods still encounter numerous issues in addition to technical problems,such as the class-imbalance problem,over-fitting issues,tuning of model parameters,and poor generalization ability.One major obstacle is the lack of a large,sufficiently comprehensive and experimentally validated enhancer set for humans or other species.Thus,the development of computational methods based on limited experimentally validated enhancers and deciphering the transcriptional regulatory code encoded in enhancer sequences is urgent.From 2006,Genoffery Hinton first introduced the concept of Deep Learning.By 2012 the Hinton team's Convolutional Neural Network model was used to kill the Quartet in the ImageNet Image Recognition Competition,and then by 2016 the Alpha Go program Win the human go master,these three events completely set off a global research on artificial intelligence technology boom.Thanks to the recent development of high-performance CPU,GPU,FPGA and other computing hardware,complex computational problems of deep learning have been solved.At the same time,by virtue of the deep learning algorithm in extracting different levels of abstract features,the powerful ability of learning characteristics,with the current massive research data,its performance has gone far beyond the traditional machine learning algorithm.Deep Learning has been widely used in many fields such as image recognition,natural language processing,speech recognition,quantitative trading and so on.Of course,the deep learning algorithm also broaden the field of biomedical research methods,in recent years,many,such as medical image processing,drug target screening,gene mutation site assessment and other issues through the deep learning method obtained good results,and have published research results.In this study,we have analyzed the research status of cis-regulation elements in detail,and focused on the various research methods related to the enhancer regulatory elements.Subsequently,we describe a general approach to solving problems using machine learning and deep learning,as well as the differences between them.By analyzing the various methods of using machine learning and deep learning to identify enhancer regulatory elements,we find that there are problems such as low accuracy,poor generalization ability and limited data source.Therefore,we developed a deeplearning-based hybrid architecture,named BiRen,that integrates the sequence encoding and representation power of a convolutional neural network(CNN)and the superior capacity for handling the long-term dependency of long DNA sequences of a gated recurrent unit(GRU)-based bidirectional recurrent neural network(BRNN)to accurately identify enhancers using the DNA sequence alone.BiRen was trained with limited experimentally validated enhancer elements derived from the VISTA Enhancer Browser that exhibit gene enhancer activity,as assessed in transgenic mice.We demonstrate that BiRen directly learns regulatory code from genomic sequences and illustrates superior identification accuracy,robustness of over-coming noise data,and generalization to other species for enhancer predictions relative to two state-of-the-art methods based on sequence characteristics such as motifs or k-mers.Our BiRen will provide researchers with a deeper understanding of the regulatory code of enhancer sequences.
Keywords/Search Tags:cis-regulatory elements, enhancer, machine learning, deep learning
PDF Full Text Request
Related items