Font Size: a A A

Prediction Of Deleterious Synonymous Mutations Based On Random Forest

Posted on:2019-07-23Degree:MasterType:Thesis
Country:ChinaCandidate:F ShiFull Text:PDF
GTID:2348330542997724Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Synonymous mutation does not change the final amino acid sequence due to the degeneracy of the codons.It is reason that synonymous mutation has been considered silent and had no influence on biological individuals in the past,but now more and more experiments have proved that synonymous mutations affect the development and progress of the disease by disrupting gene expression and cell functions through different mechanisms such as codon usage bias,translation efficiency and other factors.Meanwhile,distinguishing pathogenic synonymous single nucleotide variants(sSNVs)from neutral ones is challenging because pathogenic sSNVs tend to have a low prevalence.Although many methods have been developed for predicting the functional impact of single nucleotide variants,only a few have been specifically designed for identifying pathogenic sSNVs.Based on the current synonymous mutation prediction method,we construct a prediction model IDSV(Identifying Deleterious Synonymous Variants)with good performance from the data and feature level,and then further optimize the prediction model from the algorithm level to construct model IDSV-?(Identifying Deleterious Synonymous Variants-model ?)to improve predictive performance.At the data and feature level,we use the reliable and balanced training data set,quantify the rich features with high classification ability,select the optimal feature subset by using the sequence backward selection,and then utilize an appropriate classifier random forest algorithm to construct a new machine learning model IDSV to predict deleterious synonymous variants.Experimental results show that IDSV compared favorably with the tools which were used to predict deleterious mutations such as SilVA,DDIG-SN,TraP,CADD and FATHMM-MKL.In addition,experiment results indicate that splicing,conservation,and translation efficiency are informative features for identifying deleterious sSNVs.While the functional regions annotation and sequence features are weakly informative,they have ability to discriminate deleterious sSNVs from benign ones when combined with other predictive features.So conservation,splicing,sequence features,translation efficiency and functional regions annotation are all helpful in predicting harmful sSNVs.At the algorithm level,based on the previous and simple prediction model IDSV,we improve the classifier to construct model IDSV-? to further optimize the predictive performance.First,due to the small amount of experimental data,we split the training set into five sub-training sets and the corresponding five sub-validation sets by referring to the data segmentation method of five-fold cross validation,and then construct five sub-random-forest models.Second,random forest is a bagging ensenmble learning algorithm based on decision tree,which may integrate some sub-trees that have a weak correlation with the target classification results and strong redundancy between trees.Therefore,based on the results of the sub-validation set,we conduct a correlation and redundancy screening of the sub-trees in each sub-random-forest model,and finally obtain the optimized random forest model IDSV-? The results of the two composite measures F-measure and AUC show that IDSV-? has improved compared to our previous model and several existing predictive tools.In recent years,biomedical researchers pay more attention to the research of synonymous mutation,which makes the number of synonymous mutations expanding constantly,and the research on the pathogenesis of synonymous mutations is also being carried out constantly.The synonymous mutation prediction models(IDSV and IDSV-?)with good classification performance which our research group built will provide great convenience for their research work.At the same time,with the rapid development of personalized precision medical services,these methods can also serve as an effective aid to the diagnosis and prevention of diseases.
Keywords/Search Tags:Synonymous mutation, Pathogenicity prediction, Feature selection, Random forest, Model optimization
PDF Full Text Request
Related items