Font Size: a A A

Method For RNA Secondary Structure Prediction Based On Transformer

Posted on:2022-10-05Degree:MasterType:Thesis
Country:ChinaCandidate:C LuFull Text:PDF
GTID:2480306728983429Subject:Computer technology
Abstract/Summary:PDF Full Text Request
RNA is a complex compound with high molecular weight.RNA is involved in protein synthesis,regulates gene expression,and has important relationships with cellular processes such as cell differentiation and metabolism as well as biogenetic processes.The understanding of RNA secondary structure and then its function is one of the major problems in the fields of drug research,biochemistry and bioinformatics.Experimental methods are not only tedious and time-consuming,but also costly,which often discourage researchers.Dozens of heuristic algorithms and software libraries are available for RNA secondary structure prediction,but the computational complexity that is at the core of the problem has remained unresolved for decades.Further complicating the problem is the pseudoknot structure,which,although accounting for only about 1.4% of base pairs,often plays a functionally important role.In addition,pseudoknots are present in about 40% of RNA secondary structures and also contribute to folding into three-dimensional structures,so neglecting the pseudoknot structure of RNA would be detrimental to the development of RNA structural biology.Classical folding algorithms have reached a bottleneck in performance and accuracy,and constantly adding constraints to the algorithms does not always lead to accuracy improvements,but rather imposes a significant burden in terms of computational cost.In recent years,classical algorithms have been combined with machine learning methods to learn thermodynamic parameters,free energy parameters,folding parameters,scoring parameters,etc.,which have solved the problem of manual design constraints and critical parameters to a certain extent and achieved results comparable to or even beyond classical algorithms.In contrast,the existing machine learning models,deep learning models,and models combining classical folding algorithms with deep learning have the following three major drawbacks: the models are not pre-trained using existing data;the model encoder has weak feature extraction ability and is not good at handling long-distance dependent features;and the end-to-end models based entirely on deep learning can barely predict the RNA secondary structure of unseen families.In this paper,we propose a Transformer-based RNA secondary structure prediction model by combining the ideas of Encoder-Decoder model and improving it to address the above defects.The model uses the bp RNA-1m large database to pre-train the encoder of the model,and then migrates the model to Mathews lab's public database for further training.In this paper,an improved base-maximal pairing algorithm is proposed to further modify the output of the deep learning model,thus combining the deep learning model with the classical folding algorithm to fit the respective defects.The proposed model is cross-validated by five-fold,and the accuracy,completeness,and F1-score are 84.7%,86.2%,and 85.4% on the t RNA family without pseudoknot,and 85.8%,87.1%,and 86.4% on the 5s RNA family without pseudoknot,respectively,which are suboptimal compared with the classical folding algorithm.The F1-score improved by 10% and 25%,respectively,compared with the suboptimal F1-score of the deep learning model by 4.1% and 4%,respectively,and the accuracy,completeness,and F1-score on the tm RNA family containing pseudoknots were 62.4%,79.2%,and 70%,respectively,compared with the suboptimal F1-score of the classical folding algorithm by 31% and 9%improvement in the suboptimal F1-score compared to the deep learning model,respectively.This paper also analyzes the detailed prediction results of the model for base pairs related to pseudoknot structures in detail.The accuracy,completeness,and F1-score for pseudoknot structures of the tm RNA family are 17.5%,52.1%,and26.2%,respectively,which is a 36% improvement compared to the suboptimal F1-score of the classical folding algorithm and a 13% improvement compared to the suboptimal F1-score of the deep learning model The suboptimal F1-score of the deep learning model is improved by 13%.Further experiments and in-depth analysis of the effects of pre-training,network structure,and improved base-maximal pairing algorithm on the experimental results are also presented.
Keywords/Search Tags:RNA secondary structure prediction, Pseudoknots, Deep learning, Transformer
PDF Full Text Request
Related items