| As basics of protein’s normal biological function, post-translational modifications andprotein-protein interactions play a very important role in the life body. Due to the poorexperimental methods and the lack of sufficient data for analyses, although more than350kinds of protein post-translational modifications have been discovered, only a few of themhave been well-characterized. Conventional experimental identification of proteinpost-translational modifications sites is laborious and expensive, and the optimization ofenzymatic reaction is also a very time consuming process, these factors severely limit thedevelopment speed of the related researches. Therefore, some computational methods havebeen proposed and applied with varying success. These methods not only can efficiently,accurately predict protein post-translational modification sites, but also can provide someclues for further in vivo or in vitro confirmation. The research of protein-protein interactionswill help related researchers in-depth understand of various biological processes from thesystem point, meanwhile, it could provide a reliable data source for further exploring themechanism of zoonotic diseases, and point out the direction of new drug research anddevelopment. In this paper, we do some researches on protein post-translation modificationsites and protein-protein interactions. The main results can be summarized as follows:(1) We propose an ensemble computational method to predict lysine ubiquitylation sites.Firstly, four kinds of useful features are used to describe each amino acid of lysine site and itssurrounding sites. Secondly, in order to reduce the computational complexity and enhance theoverall accuracy of the predictor, an effective feature selection method is used to select someoptimal feature subsets. Finally, the ensemble classifier is established using the optimalfeature subsets as input, and compared with the other predictors. Experimental results haveshown that our method is very promising to predict lysine ubiquitylation sites.(2) Based on the effective pupylation substrate information, we construct a novelpredictor to predict the pupylation sites. Firstly, we extract five kinds of features for eachprotein sequence in the training dataset and use these features to encode each amino acid ofpupylation site and its surrounding sites. Then, the maximum relevance minimum redundancy(mRMR) and incremental feature selection (IFS) methods are made on the feature set to selectthe optimal feature subset. Finally, the predictor model is built based on the optimal featuresubset with the assistant of nearest neighbor algorithm (NNA), and the accuracy is70.93%bythe jackknife cross-validation. Through the biological analysis of the optimal feature subset,we find that evolutionary information and physicochemical/biochemical properties play important role in the recognition of pupylation sites, and sites7,10and11contribute the mostto the determination of pupylation sites. The experimental results indicate that thecombination of mRMR and IFS could effectively select the optimal feature subset of thebiological datasets. We can obtain satisfactory prediction performance and find the biologysignification of the selected features using the model constructed on the optimal featuresubset.(3) The composition of k-spaced amino acid pairs (CKSAAP) is first used to predictprotein phosphorylation sites, and enhanced the prediction accuracy of phosphorylation sites.When benchmarked against PPRED, DISPHOS and NetPhos, the performance ofCKSAAP_PhSite is measured with a sensitivity of84.815%, a specificity of86.07%,and anaccuracy of85.43%for serine, a sensitivity of78.59%, a specificity of82.26%and anaccuracy of80.31%for threonine as well as a sensitivity of74.44%, a specificity of78.03%and an accuracy of76.21%for tyrosine. Experimental results indicate that the proposedapproach is effective and practical. Based on the model of predicting protein phosphorylationsites, a corresponding online web server is established.(4) We propose a new augmented Chou’s pseudo amino acid composition to predictprotein-protein interactions. Firstly, three groups of descriptors are used to encode eachinteractive pair. As a result, each interactive pair is represented by930features. Then theprincipal component analysis (PCA) is utilized for dimensionality reduction. The resultingfeature subset contains few features, meanwhile, retains as much information of the whole setas possible. Finally, a protein-protein interaction prediction model is established based on theresulting feature subset, and compared with the other predictors on the Drosophilamelanogaster and the Helicobater pylori datasets. Experimental results have shown that ourmethod is very promising to predict protein-protein interactions. |