Font Size: a A A

Sequence-based Prediction Models For Pathogenicity Of Synonymous And Nonsense Single Nucleotide Variants

Posted on:2024-05-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2530306941963779Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Gene mutations have important effects on biological evolution and so closely associated with many diseases.Synonymous mutations were used to be regarded as harmless because they do not change protein sequences while nonsense mutations are usually pathogenic as they prevent the translation of downstream sequences into proteins.These two types of variants were often ignored in research on the pathogenicity prediction.However,recent studies show that synonymous mutations can also cause human disease and even nonsense mutations exist in healthy human individuals.This thesis focus on the prediction of pathogenicity of both synonymous and nonsense single nucleotide variants in human genes.An ensemble model of synonymous mutations pathogenicity prediction based on the Stacking theory and a deep learning model of nonsense mutations pathogenicity prediction based on the selfattention mechanism are constructed.We also developed a universal single nucleotide variants pathogenicity prediction framework upon the two above models.To improve the generalization of the models,the studies were conducted entirely based on gene sequences.The contents are as follows:(1)An ensemble classification model called PON-SM based on the theory of Stacking is proposed to solve performance problems caused by low data quality and insufficient utilization of gene sequence information.Firstly,synonymous mutations from the public databases are filtered and updated.And seven categories of biochemical features were calculated based on gene sequences including:conservation,sequence,codon usage,splicing,energy,translation and functional scores.Then,base models were trained with Random Forest(RF),Gradient Boosting Decision Tree(GBDT),Extreme Gradient Boosting(XGBoost)and LightGBM algorithms,and the meta-model was trained using Logistic Regression algorithm.The prediction performance of PON-SM is significantly improved with accuracy and MCC on the test set achieved 0.862 and 0.703,respectively.(2)A deep learning model PON-NS for predicting pathogenicity of nonsense mutations based on the self-attention mechanism is proposed since existing universal models do not calculate targeted features for nonsense mutations specifically.Firstly,the dataset was constructed by collecting a sufficient number of human genetic nonsense mutation data with high confidence from public databases.And the data were undersampled based on biological information.Then,self-attention mechanism was applied to learn sequence contextual association information,and abstract features at mutation sites alone were combined with sequence-derived features.The model gains higher performance with MCC=0.842 and Accuracy=0.920 on the test set.In particular,PON-NS reduced the positive false rate by 39.7%on the ExAC dataset compared to the DDIG-in method,which is the only existed model available on the gene level.(3)A new pathogenicity prediction framework for single nucleotide variants,called PON-SNV,was constructed based on the two above models on synonymous and nonsense mutations respectively.The framework contains a prediction model for missense mutations as well,which was constructed with the self-attention mechanism and convolutional neural networks.PON-SNV achieved better performance on the test set,with the prediction accuracy exceeding other methods by 9.3%and MCC by 17.7%.The pathogenicity prediction models for synonymous and nonsense mutations were constructed entirely based on sequences,thus have better generalization and higher accuracy.They can support studies of the association between genetic variants and human genetic diseases and provide important references for the diagnosis and treatment of these diseases.
Keywords/Search Tags:Synonymous mutation, Nonsense mutation, Pathogenicity prediction of single nucleotide variants, Ensemble learning, Self-attention mechanism
PDF Full Text Request
Related items