Font Size: a A A

Construction Of DenseNet-based TFBS Prediction Models For Plants And The Transfer Learning Application Of Trans-species Prediction

Posted on:2024-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:H L ChengFull Text:PDF
GTID:2530307160976429Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Most of the genetic variations that are discovered through Genome-wide association studies(GWAS)are located in non-coding regions of the genome,revealing the biological function of non-coding DNA sequences.Non-coding DNA sequences primarily rely on regulatory elements to exert their biological function.Cis-regulatory elements(CREs)and trans-acting factors play a crucial role in regulating gene expression,and their interactions can determine when,where,and how much a gene is expressed.Transcription factors(TFs)are the most common trans-acting factors,and they regulate gene expression by specifically binding to transcription factor binding sites(TFBSs)in Cis-regulatory Elements.Recently,with the popularization of genome editing technologies such as CRISPR-Cas9,editing TFBSs to study how they regulate gene expression and indirectly affect the quantitative traits of plants has become widely used by plant biologists.Therefore,the systematic identification of TFBSs,especially the recognition of their core motifs,is not only of great research significance for understanding transcriptional regulation mechanisms in biological development processes,but also has a huge demand in plant research,and has become a core content of plant breeding.However,there are two common problems in related research in the field of plants.The first is that the computational methods such as Position Weight Matrix(PWM)scanning have a high false positive rate and cannot accurately locate the core motif;the second is that the TFBSs data in the plant field are relatively lacking compared to humans and other mammals,which cannot support related transcriptional regulation research content.In response to the above problems,this study carried out two parts of work.The first component of the research,we first trained and tested the Ch IP-seq datasets for 104 maize TFs,the DAP-seq datasets for 265 Arabidopsis thaliana TFs and the Ch IP-seq datasets for20 rice TFs based on the deep learning model Dense Net,and successfully built a total of389 Dense Net models to predict TFBSs.The results showed that these 389 models are outstanding,with values of AUCs of all Dense Net models on the test sets are more than0.95 and the median of AUC is 0.9997.Then we compared it with the classical models such as LS-GKM,Deep CNN and MEME-M1-MAX,and found that Dense Net models substantially outperform the previous approaches,showing a great advantage in predicting plant TFBS.Next,in order to break through the bottleneck of previous computational methods not being able to provide accurate positions of core motifs,we explored an important application of this study,which is also a highlight of our current research: Using a combination of Deep LIFT,in-silico tiling deletion,and in-silico mutagenesis to identify potential core motifs that significantly affect TF-DNA binding.We selected a positive sample sequence(Chr9: 1715886-1716386(+))from a maize b HLH145 TF chosen randomly,and provided a detailed description of how to use these three interpretability methods to mine the core motif and determine its precise location.Based on TF-Mo DISco and Global Importance Analysis(GIA),we further provided the biological significance of the identified core motifs in all positive samples of the b HLH145 TF.As another example,we reappeared the core motif "GCGCGTGT" in the promoter regions of the rice IPA1 gene based on a new strategy that combines three deep learning interpretability methods,which provided strong support for plant gene editing and breeding.Finally,we developed a userfriendly web-server(http://www.hzau-hulab.com/TSPTFBS/)that integrates 389 Dense Net models for predicting plant TFBSs and the three interpretability methods mentioned above,which can provide model support and important references for gene editing targets of plant promoters.In the second component of the research,we used computational methods to overcome the bottleneck of the lack of relevant TFBS data in the plant field.Specifically,based on the principle of transfer learning and the similarity of TF protein sequences,we conducted trans-species prediction of TFBSs in plants.Firstly,we downloaded Ch IP-seq datasets of15 TFs that could be used for transfer learning from a website called Ch IP-Hub,covering a total of 6 species.Then,we transferred 389 pre-built Dense Net models to the 6 collected species and predicted TFBSs of the 15 TFs in these 6 species.The results showed that 73.33%of the TFs(11 out of 15 TFs)had positive predictive value(PPV),negative predictive value(NPV)and recall between 0.8 and 1.0.Meanwhile,we also compared the results obtained in this study with the trans-species prediction results of 265 Deep CNN models for predicting Arabidopsis thaliana TFBS built by our research group.The results showed that,firstly,100%(9 out of 9 TFs)of tomato TFs achieved greater PPV and NPV in this study;Second,for another 6 TFs from other 5 species,66.67%(4 out of 6 TFs)of TFs achieved greater PPV and NPV in this study;Thirdly,86.67%(13 out of 15)of TFs yielded greater PPV and NPV.Therefore,it can be concluded that the research on trans-species prediction of TFBS based on transfer learning is feasible and has important significance for solving the problem of the lack of TFBS data in the plant field.In summary,our work used deep learning algorithms based on currently available experimental data related to plant TFBS to construct the TFBS prediction models and combined three interpretability methods to point out the precise location of the core motif in a given genomic region,which initially solves the problem of low recognition accuracy of the core motifs in TFBS.The constructed models were used for trans-species prediction to solve the problem of lack of TFBS data in plants.Our research clarifies the following several important conclusions:(1)the feasibility of predicting plant TFBS based on deep learning;(2)the feasibility of identifying core motifs based on interpretability methods;(3)the feasibility of trans-species prediction based on transfer learning;(4)building the webserver for the identification and localization of core motifs in plant promoter regions can provide researchers with genome editing targets.
Keywords/Search Tags:Transcription factor binding site, Deep learning, Convolutional neural network, Biological interpretability, Core motif, Transfer learning, Trans-species prediction
PDF Full Text Request
Related items