Font Size: a A A

Applications Of Machine Learning In Biological Sequence

Posted on:2022-08-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:S YangFull Text:PDF
GTID:1480306332462274Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the deepening of the genome project and the popularization of high-throughput sequencing technologies,massive amounts of gene sequence data and RNA sequence data have been acquired.These biological sequence data contain a wealth of biological pattern information.Among them,gene sequences record most of the genetic information of the species,coding RNA affects cell function through its translated protein,and non-coding RNA affects cell function by participating in intracellular regulatory processes.The rich biological information contained in biological sequence data provides abundant knowledge to understand life at the molecular level.In recent years,with the rapid development of computer technology,machine learning,deep learning and artificial intelligence technologies have been widely used in many fields.Facing the massive biological sequence data,using machine learning,deep learning and artificial intelligence to explore its potential biological patterns is significant to modern medicine and biology.The core contents of this paper are:(1)efficiently representing different types of biological sequence data to digital vectors;(2)using deep learning networks to build efficient biological sequence classification models.This paper uses three real applications in the genome and transcriptome to show the strategies of machine learning in different biological sequence problems.The first study shows how to use graph embedding,deep learning methods to identify essential genes more accurately based on gene sequence data.Essential genes are the minimum gene set that maintains cell survival,development and proliferation.With the accumulation of studies on essential genes,many essential genes have been reported by different cell experiments.For the distinct conditions of different cell experiments,the discovered essential genes are not coincident.Therefore,how to use machine learning methods to identify essential genes more accurately is significant for the discovering of genes related to diseases and cancers.in Chapter 3,a deep learning model called EGNet(Essential Gene Network)was proposed which uses the graph embedding method to represent the graph constructed by gene sequences and combines convolutional neural networks,fully connected networks to identify essential genes.EGNet provides new references for how to use machine learning methods to study gene sequence-related issues.The second study shows how to use sequence representation,deep residual network methods to identify short ORF(Open Read Frame)non-coding RNA by RNA sequences.Although non-coding RNA cannot encode proteins,it can participate in the intracellular regulation process and affect cell function.Existing methods can accurately identify long ORF non-coding RNAs,but they are not suitable for short ORF non-coding RNAs.Therefore,in Chapter 4,a deep residual network model celled NCRes Net(Non-Coding Residual Network)was proposed.NCRes Net represents RNA sequences from four levels including DNA sequence,protein properties,RNA biochemical properties,and RNA structure information and combines deep residual networks to identify non-coding RNA.The experimental results show that NCRes Net is not only suitable for the recognition of short ORF non-coding RNAs but also appropriate in long ORF non-coding RNAs.NCRes Net provides a new reference for machine learning in RNA sequence-related research.The third study of this paper shows how to use cascaded feature learning and dualchannel convolutional neural network to predict the interaction between non-coding RNAs by RNA sequences.Lnc RNA(Long non-coding RNA)can competitively bind with mi RNA(micro RNA),which affects the regulation between mi RNA and genes.This regulatory pattern often appears in diseases and cancers.How to accurately predict the interaction between Lnc RNA and mi RNA is significant for the study of the internal non-coding RNA regulation in diseases and cancers.Therefore,in Chapter 5,a prediction model called Lnc Mir Net(Lnc RNA mi RNA interaction Network)was proposed to predict the interaction between Lnc RNA and mi RNA.cascading feature learning,multiply features fusing and dual-channel convolutional neural network are used to distill contributed features for the following prediction.Compared with the existing methods,Lnc Mir Net achieves better comprehensive performance on public datasets.Lnc Mir Net provides a new reference for studying the interaction between RNAs and a new solution for machine learning in the problem of sequence interaction.The three models proposed in this paper correspond to three types of biological sequence data,namely gene sequence,RNA sequence,and interaction sequence data.The main contributions are(1)representing the gene sequence by graph embedding method;(2)representing the RNA sequence through multi-levels attributes guided by central dogma;(3)representing the interaction structure data by cascading feature learning and feature space learning methods;(4)showing the different policies of machine learning in different types of biological sequences problems.The research works in this paper are strongly cutting-edge,theoretical and constructive.Each study is progressive and mutually supportive,which provides a good technical reserve and guidance for the application of machine learning in biological sequences.
Keywords/Search Tags:Machine Learning, Deep Learning, Essential Genes, Noncoding RNA, Interaction Between Noncoding RNAs
PDF Full Text Request
Related items