Font Size: a A A

Prediction Method Research Of Special Protein Recognition Based On Protein Sequence Information

Posted on:2019-12-31Degree:MasterType:Thesis
Country:ChinaCandidate:Y B WangFull Text:PDF
GTID:2370330626952405Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Research of protein recognition is an important research branch in the field of bioinformatics.Its task is to build a model that can be able to correctly classify unknown protein species.The research we have completed is the protein recognition study for both DNA binding protein and protein crystallization.Firstly,since DNA-binding proteins play an important role in a variety of biomolecule functions,and protein crystallization is a key step in determining protein structure by X-ray crystallography,recognition and prediction of DNA-binding proteins and protein crystallization are particularly important.Although the traditional biological experimental methods are more accurate,they are time-consuming,laborious and expensive.With the explosive growth of the number of protein sequences,biological experimental methods have been unable to meet people’s needs,so we urgently need the calculation methods of high-precision,high-speed and low-consumption to carry out related work.Because most proteins do not have structural information about them,protein recognition studies based on protein sequence information are more applicable to the current situation.In this paper,we describe two aspects of research work in detail:First,in the work of DNA binding protein recognition,we designed three feature extraction algorithms,namely Normalized Moreau-Broto Autocorrelation(NMBAC),Position-specific scoring matrix-Discrete Cosine Transform(PSSM-DCT)and Position-specific scoring matrix-Discrete Wavelet Transform(PSSM-DWT).We use Support Vector Machine Recursive Feature Elimination combined with Correlation Bias Reduction(SVM-RFE+CBR)for feature selection.The leave-one-out cross validation is used for evaluation on the training datasets PDB1075 and PDB594,and the independent test was used for evaluation on the test set PDB186.The model algorithm was support vector machine(SVM).Next,in the research of protein crystallization recognition,we used six feature extraction algorithms,namely,Average Block-Position specific scoring matrix(AVBlock-PSSM),Average Block-Secondary Structure(AVBlock-SS),Global Encoding(GE),Pseudo-Position specific scoring matrix(PsePSSM),Protscale and Discrete Wavelet Transform-Position specific scoring matrix(DWT-PSSM).The extracted features are linearlycombined to establish an SVM model for prediction.We used two sets of datasets,of which TRAIN3587 and TRAIN1500 are training datasets and they obtained prediction results by five-fold cross-validation method.Their respective test sets were TEST3585 and TEST500 and were used in independent tests,their corresponding test datasets are TEST3585 and TEST500 and they are used in independent testing.Finally,in the study of DNA-binding proteins recognition,Our method achieved the highest accuracy in the datasets PDB1075 and PDB594.The accuracy of our method in independent testing has also reached 76.3%,which is superior to most existing methods.In the study of protein crystallization recognition,our method achieved the best results in the first set of datasets TRAIN3587 and TEST35855.And our method also achieved the optimal result in the training dataset TRAIN1500 through five-fold cross-validation.The result of independent test on the test dataset TEST500 is not optimal but it has surpassed the prediction effect of most existing methods.Both of these studies have shown that these methods proposed by us for the identification of DNA-binding proteins and protein crystallization have obvious advantages and can be well used in the identification of related proteins.
Keywords/Search Tags:Protein recognition, Protein sequence information, Feature extraction, Support vector machine
PDF Full Text Request
Related items