Font Size: a A A

Research On Protein Folds Prediction Algorithm Based On Machine Learning

Posted on:2011-03-13Degree:MasterType:Thesis
Country:ChinaCandidate:R F WangFull Text:PDF
GTID:2178330332464425Subject:Signal and Information Processing
Abstract/Summary:
Protein is composed of amino acid sequences. As long as amino acid sequences fold into spatial structure, protein has its biological activity and function. Researches show that the number of natural protein folds is limited, from several hundreds to a thousand. Launching systematic research of these protein folds and developing effective prediction algorithms is meaningful to uncover the principle of protein folding, to provide a reference for accurately experimental decision of protein structure as well.Protein folds prediction methods can be classified into two kinds:homology modeling methods and taxonomic methods. Though homology modeling methods are efficient when sequence similarity is high, they could just predict rough fold pattern and the credibility descends greatly as sequence similarity decreases. Taxonomic methods don't rely on similarity, and they can also correctly recognize fold pattern for distant homologous proteins. In substance, taxonomic methods utilize machine learning techniques to predict protein folds by extracting features from primary sequences structure.This paper summarizes the general steps for the application of machine learning techniques in protein folds prediction, including feature extraction, the optimized combination of feature vectors, the selection of basic classifier, protein folds prediction and performance evaluation. As for the optimized combination of feature vectors, existing research methods use "one by one adding" strategy, which has many drawbacks. It can't find the most optimized combination of feature vector. We take advantage of genetic algorithm for the optimized combination. It not only makes up for these shortcomings, but also calculates the weights of each feature vector, which can be used to evaluate the merits of the feature. In addition, regarding to performance evaluation, we also analyze the generalization ability for practical application through ROC curve besides sensitivity and overall accuracy.With the help of SCOP's hierarchical structure, a multi-layered predicting architecture based on random forest is proposed (named MLPA-RF). The revised feature representation combines amino acid composition vector based on evolutionary information and predicted secondary structure based on the PredictProtein. Experiments on the common data sets show that our method achieves much higher accuracy, lower complexity but stronger generalization ability than existed methods. In addition, MLPA is easy to extend since new classification algorithms can be embedded directly. It is also very convenient to transplant. So, our method provides new ideas for protein folds prediction.
Keywords/Search Tags:Protein folds prediction, Machine learning, Genetic algorithm, MLPA-RF
Related items