Font Size: a A A

PsePSSM-based Prediction For The Protein-ATP Binding Sites And Classification Of Membrane Proteins

Posted on:2019-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhangFull Text:PDF
GTID:2370330596488429Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Protein sequence feature extraction is a key part of bioinformatics research.Content-Based Composition of Amino Acid Components K-mer is a commonly used method for extracting protein features,but has short-sequence features with sparse defects.The pseudo position specific scoring matrix(PsePSSM)originates from the position specific scoring matrix(PSSM)and can reflect sequence evolution information and is suitable for inequality sequences.However,its fault-tolerance merits have not yet received sufficient attention.This article applies PsePSSM to protein-ATP binding site prediction and membrane protein classification.The predicted results are as follows:PsePSSM-based prediction for the protein-ATP binding site.The prediction of protein-ATP binding sites is highly unbalanced binary classification problem,high precisionarily predicting the protein-ATP binding sites with the machine learning approach is of major significance to the study associating with the functions of proteins and the designing of drugs.The sample sequence is equal in length,most of existing researches empirically choose 17 aa as the length of window,and extract features through the Position-specific Scoring Matrix(PSSM),and then create models to predict with SVC.In these researches,independent prediction values for the Acc are too-high and that for the MCC are lower,and there is therefore larger improving room for the prediction precision.In this paper,the mutual information I is used to determine the length of window as 15 aa,the PsePSSM of more fault-tolerance is utilized to extract features,and then multiple 1:1 SVC classifiers are trained for modeling,and finally,the simple votings are carried out.Clearly,the prediction results aimming on both protein-ATP binding site data sets,the ATP168 and the ATP227,are all superior to the independent prediction results attained with the Reference Feature Extraction Approach,the values for the MCC gotten by our method are respectively improved,from the range of 0.3110 ~ 0.5360 and the range of 0.3060 ~ 0.553,to 0.7512 and 0.7106.Further,we offer the reasons why the PsePSSM approach owns the strong fault-tolerance.PsePSSM-based prediction for the membrane protein classification.Membrane protein classification prediction is a typical problem of unequal length and multi-classification of protein sequences.PsePSSM can effectively solve the problem of sequence unequal length.Since the shortest sequence in the data set used in this paper is 50 aa and the maximum separation distance is 25 aa,each sequence can be characterized by 520 PsePSSM features.Based on SVC modeling,its independent prediction accuracy Acc is 66.86%.Feature selection can reduce the complexity of the model and improve the prediction accuracy.By using the feature selection method MIC-share,which can automatically terminate the feature introduction,an optimal feature subset containing 16 reservation features is obtained.The independent prediction accuracy Acc is 86.41%.There has been a significant increase in the selection of features that have not been implemented.The effects of three multi-classification transfer classification strategies,such as OVO(one-to-one),OVA(one-to-all)and HC(hierarchical classification),on the prediction accuracy are further discusse.PsePSSM,which reflects the sequence evolution information,is suitable for inequality sequences,and has strong fault tolerance,has broad application prospects in protein sequence extraction.
Keywords/Search Tags:PsePSSM, protein-ATP binding site prediction, Membrane protein classification prediction, SVC, MIC-share
PDF Full Text Request
Related items