Research On Protein Folds Prediction Algorithm Based On Machine Learning

Posted on:2011-03-13

Degree:Master

Type:Thesis

Country:China

Candidate:R F Wang

Full Text:PDF

GTID:2178330332464425

Subject:Signal and Information Processing

Abstract/Summary:

Protein is composed of amino acid sequences. As long as amino acid sequences fold into spatial structure, protein has its biological activity and function. Researches show that the number of natural protein folds is limited, from several hundreds to a thousand. Launching systematic research of these protein folds and developing effective prediction algorithms is meaningful to uncover the principle of protein folding, to provide a reference for accurately experimental decision of protein structure as well.Protein folds prediction methods can be classified into two kinds:homology modeling methods and taxonomic methods. Though homology modeling methods are efficient when sequence similarity is high, they could just predict rough fold pattern and the credibility descends greatly as sequence similarity decreases. Taxonomic methods don't rely on similarity, and they can also correctly recognize fold pattern for distant homologous proteins. In substance, taxonomic methods utilize machine learning techniques to predict protein folds by extracting features from primary sequences structure.This paper summarizes the general steps for the application of machine learning techniques in protein folds prediction, including feature extraction, the optimized combination of feature vectors, the selection of basic classifier, protein folds prediction and performance evaluation. As for the optimized combination of feature vectors, existing research methods use "one by one adding" strategy, which has many drawbacks. It can't find the most optimized combination of feature vector. We take advantage of genetic algorithm for the optimized combination. It not only makes up for these shortcomings, but also calculates the weights of each feature vector, which can be used to evaluate the merits of the feature. In addition, regarding to performance evaluation, we also analyze the generalization ability for practical application through ROC curve besides sensitivity and overall accuracy.With the help of SCOP's hierarchical structure, a multi-layered predicting architecture based on random forest is proposed (named MLPA-RF). The revised feature representation combines amino acid composition vector based on evolutionary information and predicted secondary structure based on the PredictProtein. Experiments on the common data sets show that our method achieves much higher accuracy, lower complexity but stronger generalization ability than existed methods. In addition, MLPA is easy to extend since new classification algorithms can be embedded directly. It is also very convenient to transplant. So, our method provides new ideas for protein folds prediction.

Keywords/Search Tags:

Protein folds prediction, Machine learning, Genetic algorithm, MLPA-RF

Related items

1	Research Of Hotspot Prediction At Protein-Protein Interfaces Using Machine Learning
2	Research On Prediction Of Protein-protein Interactions Based On SIFT Algorithm And Parallel Support Vector Machine
3	Research On Hierarchical Classification Based On Conceptual Semantic Hierarchy
4	Study Of Protein Function Prediction Using Semi-supervised Learning
5	Research On Technology Of Computational Biology For Protein Structure Prediction
6	ECPF:An Efficient Algorithm For Expanding Clustered Protein Families
7	Research Of Data Mining Technology Apply In The Protein Secondary Structure Prediction
8	Study On Some Information Extraction Algorithms In Protein Subcellular Localization Prediction
9	Hybrid Algorithms For Protein Structure Prediction Problems Of AB Off-lattice Model
10	Research Of Imbalanced Dataset And Application In Prediction Of Protein-Protein Interaction Sites