Font Size: a A A

Protein Crystallization Propensity Prediction Based On Ensemble Learning

Posted on:2015-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:T Y WangFull Text:PDF
GTID:2250330428990979Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the main experimental technique of protein structure analysis, X-raycrystallography method accounts for majority of solved structures. However, it ischaracterized by relatively low success rates. Thus, it has important practicalsignificance to predict whether or how much likely a target protein might be able tocrystallize Aiming at this problem, we firstly, download the latest data from thedatabase online to construct the training data set, and then choose a morecomprehensive feature set on the basis of the analysis of relevant literatures, finallybuild the protein crystallization preference classifier by using Ensemble Learningmethod.In this paper, our method is proposed on the basis of making an intensive studyof protein crystallization propensity prediction methods in recent years, includingCRYSTALP, XtalPred, ParCrys, MetaPPCP, PXS, MCSG-Zscore, SCMCRYS, andso on. Analyzing these main research methods, this paper selects20features to buildour feature set which considers both sequence related information andphysicochemical property related features. For the sequence related features, itincludes such as the length of the protein sequence, molecular weight, a certainpercentage of amino acids in the protein content, the protein secondary structure, thecombination of specific peptide bond information, and so on. While considering thephysicochemical property related information, we select the grand average ofhydropath, pi, the instability, the energy index, signal peptide, membrane Protein andother information as final features. This article downloads817,000protein sequencesexperiment trails from PepcDB database online with the last update time of December30,2013. After a vigorous selection process, we get the final dataset with a balancedscale for crystallization and non-crystallization classes. According to the "currentstate" all data are divided into four categories, namely, produce proteins failed,purification failed, crystallization failed and crystallization, respectively.Finally, our method builds the prediction model for protein crystallizationtendency by using Ensemble Learning method. The implementation process is to traina series of classifiers according to some certain principles, then to make decisions by integrating the prediction results of all classifiers through some kind of strategy (suchas voting strategy), and lastly get better results than a single classifier. In this paper,bagging algorithm is selected as the design principle of our prediction model, and thesupport vector machine (SVM) is chosen as the basic classification of the ensemblemodel.In the experimental design aspects, ensemble on the feature set dimension and onthe training sample set dimension are both investigated, as well as the combination ofthese two dimensions. Experimental results show that the feature based sequenceinformation and physicochemical properties having a certain complementary.Furthermore, the ensemble of the training sample data set and the ensemble of thefeature set both improves the accuracy of the algorithm, while ensemble themsimultaneously can produce the best results. Finally, some important questions of thisstudy in the future direction are prospected.
Keywords/Search Tags:X-ray Crystallography, Machine Learning, Ensemble Learning, BaggingAlgorithm
PDF Full Text Request
Related items