Font Size: a A A

Algorithmic Research On Deep Representation And Classification Of Nucleotide Sequence Based On Adversarial Learning

Posted on:2022-11-07Degree:MasterType:Thesis
Country:ChinaCandidate:G C ZhuFull Text:PDF
GTID:2480306761959679Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
For eukaryotic cells,their biological processes were orchestrated by complicated biological mechanisms at different levels.DNA sequences not only are the cornerstone of carrying biological information,but also the bridge of delivering information,which include signals required in biochemical processes.Genes must pass through two stages of biochemical processing of transcription and translation to achieve their specific functions.Therefore,the detection of genomic signals and regions(GSR)is critical for understanding genome organization,gene regulation,and gene function.Two variants of GSRs involved in this study were polyadenylation signals(PAS)and translation initiation sites(TIS).The biochemical process stage associated with PAS is transcription;TIS is associated with translation,corresponding to the two staple processing stages.So far,a host of computational methods were proposed for detecting the two kinds of GSRs.However,these methods had certain drawbacks,such as aiming only at one type of GSRs,or even one type of eukaryote.;some models were difficult to generalize and may not be robust.There is still a lot of room for improvement in the performance of these models.Consequently,the objective of the research emphasized on how to address these problems.In this study,a novel deep learning framework based on DNABERT,adversarial training,Bi GRU,and multi-scale convolutional neural networks was proposed.It aimed to build an end-to-end,general,and robust recognition model that does not need to develop relevant features specifically for biological sequence tasks.The model was capable of extracting deep information and recognize patterns.Our proposed model was trained and evaluated on 12 genome-wide and cross-organism datasets.Compared with the state-of-the-art(SOTA)methods,the results showed that better performance was observed from the proposed method than the SOTA methods in three metrics.In addition,hyper-parameter optimization experiments and ablation study experiments were conducted to obtain higher model performance,as well as demonstrated the necessity of each component in the model.In addition,there were 9 datasets related to GSRs and released in other works,and we also trained and evaluated on these datasets to reveal the model's generalization capability and achieve multi-domain adaptation.The results indicated that the model also performed the best in terms of performance on these datasets when compared to methods.Finally,it is concluded that the model was suitable for GSRs identification task.The model can be easily generalized to other domains,which provides a general framework for prediction and regression tasks in biological sequences.In future studies,a web server is going to be built and uploaded previously trained models of this study to provide a convenient way to automatically identify GSRs and generalize them to the biological recognition tasks of protein secondary structure,splice sites,polyadenylation cleavage site,stop codon,etc.
Keywords/Search Tags:biological sequence, Genomic Signals and Regions, deep representation, adversarial training, feature fusion
PDF Full Text Request
Related items