Protein Crystallization Propensity Prediction Based On Ensemble Learning

Posted on:2015-02-04

Degree:Master

Type:Thesis

Country:China

Candidate:T Y Wang

Full Text:PDF

GTID:2250330428990979

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

As the main experimental technique of protein structure analysis, X-raycrystallography method accounts for majority of solved structures. However, it ischaracterized by relatively low success rates. Thus, it has important practicalsignificance to predict whether or how much likely a target protein might be able tocrystallize Aiming at this problem, we firstly, download the latest data from thedatabase online to construct the training data set, and then choose a morecomprehensive feature set on the basis of the analysis of relevant literatures, finallybuild the protein crystallization preference classifier by using Ensemble Learningmethod.In this paper, our method is proposed on the basis of making an intensive studyof protein crystallization propensity prediction methods in recent years, includingCRYSTALP, XtalPred, ParCrys, MetaPPCP, PXS, MCSG-Zscore, SCMCRYS, andso on. Analyzing these main research methods, this paper selects20features to buildour feature set which considers both sequence related information andphysicochemical property related features. For the sequence related features, itincludes such as the length of the protein sequence, molecular weight, a certainpercentage of amino acids in the protein content, the protein secondary structure, thecombination of specific peptide bond information, and so on. While considering thephysicochemical property related information, we select the grand average ofhydropath, pi, the instability, the energy index, signal peptide, membrane Protein andother information as final features. This article downloads817,000protein sequencesexperiment trails from PepcDB database online with the last update time of December30,2013. After a vigorous selection process, we get the final dataset with a balancedscale for crystallization and non-crystallization classes. According to the "currentstate" all data are divided into four categories, namely, produce proteins failed,purification failed, crystallization failed and crystallization, respectively.Finally, our method builds the prediction model for protein crystallizationtendency by using Ensemble Learning method. The implementation process is to traina series of classifiers according to some certain principles, then to make decisions by integrating the prediction results of all classifiers through some kind of strategy (suchas voting strategy), and lastly get better results than a single classifier. In this paper,bagging algorithm is selected as the design principle of our prediction model, and thesupport vector machine (SVM) is chosen as the basic classification of the ensemblemodel.In the experimental design aspects, ensemble on the feature set dimension and onthe training sample set dimension are both investigated, as well as the combination ofthese two dimensions. Experimental results show that the feature based sequenceinformation and physicochemical properties having a certain complementary.Furthermore, the ensemble of the training sample data set and the ensemble of thefeature set both improves the accuracy of the algorithm, while ensemble themsimultaneously can produce the best results. Finally, some important questions of thisstudy in the future direction are prospected.

Keywords/Search Tags:

X-ray Crystallography, Machine Learning, Ensemble Learning, BaggingAlgorithm

PDF Full Text Request

Related items

1	A Study For Ensemble Learning Based On SVM
2	Ensemble Forecast Bias Correction Based On Machine Learning Methods
3	Research On Weather Elements Prediction Based On Ensemble Learning Model And CNN-ALSTM
4	Machine Learning Researches On Exploration Of The H2H Gene In 3D View
5	Research On Functional Protein Prediction Method Based On Ensemble Learning
6	Research On Housing Pricing Model Based On Machine Learning
7	Research And Implementation Of Chromatin Topologically Associating Domain Detection Algorithm Based On Ensemble Learning
8	Research On Long Non-coding RNA Identificationition Based On Machine Learning
9	Research On β-Lactamase Prediction And Annotation Analysis Based On Ensemble Learning
10	Research On Plant NcRNA Interactions Based On Ensemble Learning