Font Size: a A A

Research On Identifying Specific Gene Sequence And Its Association Based On Deep Learning

Posted on:2022-11-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:S Q ZhangFull Text:PDF
GTID:1480306758979199Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Genes record genetic information of the human body,control all stages of human growth,and are also a research hotspot in bioinformatics.The Human Genome Project aims to sequence the human genome,which helps researchers reveal the function of genes.Now,there are more than 20,000 human genes,they constitute complex gene networks and cooperate to control activities of the human body.The activity of genes is influenced by many regulatory factors,among which transcription factors(TFs)are crucial for the regulation of genes.TFs are a type of proteins that bind to specific DNA sequences.Specific DNA sequences bound by TFs are called transcription factor binding sites(TFBSs).TFs regulate the expression of downstream genes via TFBSs.The aligned TFBSs of the same TF are often conserved at the sequence level,which is called a cis-regulatory motif(motif).Since gene expression is regulated by TFs,the study of these specific gene sequences is greatly significant for understanding the life activities of human and treatment of diseases.At the same time,regulation of genes by TFs will not only change their relatedness with target genes but also affect other genes through gene networks.So,measuring changes of relatedness between genes is also crucial for grasping the regulatory effect of TFs on genes.With the development of computer technology,deep learning(DL)technology has been widely applied to bioinformatics,which promoted the development of biological data analysis technology.Therefore,this article designs and develops its algorithm using DL technology to identify TFBSs and measure the conditional relatedness between genes.The main contents of this article are as follows:1.Evaluating 21 DL models for TFBSs identification and motif finding,and developing a deepmotif server.Firstly,this article collects 21 DL models that evaluate TFBSs identification,motif finding,models' scalability,and usability using 690 ENCODE Ch IP-seq(Chromatin Immunoprecipitation Followed by Sequencing)datasets,126 cancer Ch IP-seq datasets,and 55 CLIP-seq(Crosslinking Immunoprecipitation Sequencing)datasets.based on the above results,a free deepmotif web server is developed to identify TFBSs and find motifs.This study provides a reference for researchers to use different datasets to select an appropriate model and a set of DL strategies in TFBSs identification,motif finding,models' scalability and models' usability.At the same time,this article found that existing models are highly complementary to each other,data size and type and models' outputs are the basis for researchers to select suitable DL models.2.Proposing a cascaded convolutional neural network(Cac Pred)model for TFBSs identification.Through analysis and evaluation of 21 DL models,we found that convolution plays a crucial role in TFBSs identification,and the performance of existing DL models in TFBSs identification needs to be further improved.Based on the above findings,this article develops the Cac Pred model for TFBSs identification.The Cac Pred is a DL model based on the convolutional algorithm,which contains six layers,i.e,a convolutional layer,a deconvolutional layer,a combination layer,two concatenated convolutional layers,and a fully connected layer.Cac Pred model utilizes forward DNA sequences and reverse complementary sequences as inputs,which helps Cac Pred learn more sequence information and obtain TFBSs identification accurately.Experimental results show that Cac Pred achieves the most accurate TFBSs identification and obtains the highest scores across nine evaluation metrics.To explain Cac Pred model,motifs are used to represent features that Cac Pred learns from given sequences.Experimental results demonstrate that Cac Pred can find matched motifs.The development of Cac Pred model provides a good technical reserve and an auxiliary function for the accurate TFBSs identification from large-scale data.3.Developing a fully connected convolutional neural network model(FCNN)to measure the relatedness between genes under different conditions.The regulation of genes by TFs will not only change their relatedness with target genes but also affect other genes through gene networks,so measuring the relatedness between genes is crucial for studying the regulation of TFs on genes.Specifically,this study develops a FCNN by adding a fully connected layer to a traditional CNN so that FCNN can utilize low dimension data as inputs.Then,gene samples are collected from COXPRESdb,KEGG,PPI,and TRRUST databases,and expression similarity and prior knowledge similarity features of gene samples are calculated respectively.Next,this chapter utilizes 12 gene features of expression similarity and prior knowledge similarity to calculate the conditional relatedness between genes.The FCNN model is the first DL model that applies DL technology to calculate the conditional relatedness between genes,which has great advantages over traditional methods for calculating genes' relatedness and achieves higher accuracy.Finally,FCNN model is utilized to build cancer gene networks,such as bladder urothelial carcinoma,breast invasive carcinoma,colon adenocarcinoma,and lung adenocarcinoma.Experimental results demonstrate that cancer gene networks constructed based on FCNN model enriched the most biological pathways.The main contribution of this paper is based on DL theory,which focuses on TFBSs identification and conditional relatedness of genes via studying related algorithm research from different perspectives: evaluating and analyzing 21 DL models that identify TFBSs and developing a deepmotif web server;Based on the CNN algorithm,Cac Pred model was developed to identify TFBSs and further improve the performance of existing models;Based on CNN algorithm,FCNN model was developed to measure the conditional relatedness between genes.This paper has strong frontiers,theoretical significance,and scientific value.There are connections and supports between three parts of the work,which jointly promote the research and analysis of the function of TFs,and lay a solid foundation for further analysis of the function of TFs in the future.
Keywords/Search Tags:Deep learning, genes, transcription factor binding sites, motif, conditional relatedness
PDF Full Text Request
Related items