Font Size: a A A

Prediction Of The Amino Acid Sequences Critical For Regulating Protein Phase Separation

Posted on:2022-07-26Degree:MasterType:Thesis
Country:ChinaCandidate:S H LiFull Text:PDF
GTID:2480306326497984Subject:Bio-engineering
Abstract/Summary:PDF Full Text Request
Phase separation refers to the spontaneous generation of another phase with different physical and chemical properties by some proteins or nucleic acid molecules in the originally uniform environment,which is the basis for the formation of membraneless organelles.Phase separation plays a variety of biological functions in the organisms,such as transcription,autophagy,etc.When phase separation is abnormal,it may cause some neurological diseases and tumors.In general,one of the most important studies of phase separation is to identify the regulatory sequence of protein phase separation to explore its mechanism.However,there is still a lack of effective tools to identify the critical amino acid sequences that regulates protein phase separation.Compared with traditional experimental methods,bioinformatics calculation methods have the advantages of low cost and fast running speed.Therefore,it is very important to develop corresponding calculation tools.In this work,we manually collected experimentally verified crucial amino acid sequences that regulated protein phase separation by consulting the literature in the PubMed database,and then used random forest algorithm to build model,optimized hyperparameters through the TPOT package,and finally developed the prediction tool dSCOPE and constructed the corresponding web service.In addition,based on the constructed prediction software,a series of bioinformatics analysis of the human proteome were carried out,and the following results were obtained.(1)Data collection and feature description.All the experimentally verified amino acid sequences crucial for regulating the phase separation were collected from the published literature,and removed the redundancy,then their physicochemical characteristics were analyzed.It was found that the crucial sequences that regulated protein phase separation were more likely to be composed of polar,uncharged amino acids,low complexity,disorder,structure similar to prion-like,and often exposed.For benchmark dataset,we generated 15-length peptides through the sliding window with a step size of 8.Among these peptides,the peptides in human proteins were treated as the training dataset,while the peptides in yeast proteins were treated as the testing dataset.In total,we obtained 1,737 positive peptides and 3,125 negative peptides for the training dataset and 379 positive peptides and 1,075 negative peptides for the testing dataset.(2)Construction and evaluation of the prediction model.We used the human amino acid sequence data as the training dataset and the yeast data as the testing dataset.Four feature extraction methods(including amino acid composition,composition of k-spaced amino acid pairs,position-specific scoring matrix and binary encoding profiles)were adopted,while the eight kinds of physical and chemical properties such as prion-like regions,surface accessibility,polar,charge,hydropathy,exposure,low complexity regions and disorder were integrated.Then we constructed the random forest algorithm model and optimized the parameters,n-fold cross-validations and the independent testing dataset validation were used to evaluate the model performance.The cross-validation results showed that the prediction model the random forest algorith had good robustness,the AUC values of 4-fold,6-fold,8-fold and 10-fold cross-validations were 0.8204,0.8129,0.8238 and 0.8213,respectively.In the independent testing dataset,the AUC value of the prediction model was 0.8463,which was better than the existing prediction tools.(3)Development of the dSCOPE web server.We used Python,Java Script,PHP and HTML,and integrated information such as protein secondary structure,cell sublocalization and physicochemical properties of amino acids,and finally built a comprehensive web service dSCOPE for predicting crucial amino acid sequences which can regulate protein phase separation.(4)Prediction and analysis in the proteome.Based on the dSCOPE software,bioinformatics analysis of the human proteome was further carried out,including protein post-translational modification(PTM)analysis,functional annotation,interactions between phase-separated proteins and kinases and transcription factors,and pan-cancer analysis of mutations.Functional annotation and PTM analysis showed that phase separation was involved in transcription,proliferation,apoptosis and other physiological pathways,while lysine modification and phosphorylation can affect the occurrence of protein phase separation.In the pan-cancer mutation analysis,we found that the prediction results of dSCOPE were consistent with the experimental data,and the tumorigenic missense mutations were all enriched in the regions containing functional domains.In conclusion,dSCOPE is a stable predictor with excellent performance to detect amino acid sequences critical for phase separation.And at the same time,it provides a variety of useful information visualization in the web server,which can facilitate phase separation related researches.
Keywords/Search Tags:Proteins, Phase separation, Random forest algorithm, Amino acid sequence, Prediction software
PDF Full Text Request
Related items