Font Size: a A A

The Research On Protein Function Prediction Technology Based On Classification

Posted on:2011-06-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y M ChenFull Text:PDF
GTID:1118330341451760Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Proteins are macromolecules that serve as building blocks and functional components of a cell, and account for the second largest fraction of the cellular weight after water. Proteins are responsible for some of the most important functions in an organism, including constitution of the organs, the catalysis of biochemical reactions necessary for metabolism, and the maintenance of the cellular environment and so on. As the most essential and versatile macromolecules of life, and the knowledge of their functions is a crucial link in the development of new drugs, better crops, and even the development of synthetic biochemicals such as biofuels.Predicting protein functions with biological experimental is costly and time-consuming, and can not meet the development needs of the contemporary life sciences. Modern high-throughput molecular biology experiments generate vast amounts of data wealth, which makes predicting protein function with computational approach very important. Modern life science research has become a data-driven discipline.The paper focuses protein function prediction based on classification techology of machine learning. Each protein is modeled as points in property space, or explicitly described as an attribute vector, or similarity kernel between two proteins is calculated. For each protein, functional annotation indicates what functons it has, while it does not tell us what it does not has. Protein function prediction learning classification model based on known protein annotations to predict unknown proteins, and provide experimental reference for the biologist.This paper uses support vector machine as base classifier to predict protein function because it based on statistical learning theory and have a good generalization performance. The complexity of protein property space needs to learn a nonlinear classifier. With kernel function, mapping properity space to high-dimension feature space makes lenear classifier be able to be learned. Based on the hierarchy of the Gene Ontology and annotation of known genes, the paper have developed a novel kernel matrix, the algorithm proceed in four phrases: computing semantic similarity between terms, calculating function similarity between protein, constructing proteins similarity graph and searching optimal diffusion kernel. The kernel matrix has better ROC performance than several typical kernel matrixes.For many functional classes, considering proteins assigned to this function as positive examples and others as negative examples will lead to class imbalance. Furthermore, the potential positive examples hidden in negative examples set also demage the quality of classifer. Paper explored the method solving class imbalance problem and proposed a schema which first creates a synthetic positive examples to enlarge positive set, then iteratively trains SVM to extract appropriate-size and representive negative examples set. As a result, above two problems are well solved. Compared with typical approaches selecting training set, cross-validation on the known genes shows good F value, the ROC curve evaluating prediction performance for unknown genes also shows good generalization performance of the method.According to characteristics of protein function annotation, the paper attributed the protein function prediction to semi-supervised classification problems which learns classifier from few labeled examples and a large number of unlabeled examples. This paper explores the theories and methods of semi-supervised classification and proposed a novel approach. Firstly it enlarges positive examples set using clustering on weighted graph, then extract few negative examples using clustering for probable negative example, finally, adapting the famous tri-training algorithm to learn three classifiers on three views. The prediction is obtained using major voting. Using precision p, recall r, and their combination F, experiment shows that the method outperforms some classical methods.As each protein can have multiple functional classes, the paper formulate function prediction problem as a multi-label classification problems. This paper illustrates the characteristics of multi-label classification and common solutions for this study. For the case with more than 100 feature classes, the classic method showed great computational complexity. This paper presents a simple, flexible function prediction framework with dynamic-threshold support vector machine ensemble. It is divided into two phases: exploring class hierarchy to select the appropriate set of training examples to learn support vector machine classifier, predicting protein functions with top-down fashion using dynamical threshold based on class hierarchy. Using precision p, recall r and a combination of F to evaluate flat prediction performance, introducting the hierarchical accuracy, hierarchical recall and hierarchical F to measure further the hierarchical consistency of prediction, experiments show that selection strategy of the training examples and dynamic threshold policy are effective.Overall, the paper makes a thorough research on protein function prediction based on classification techniques and can provide with machine learning peer and biological experts with reference.
Keywords/Search Tags:Protein function prediction, support vector machine, class imbalance, semi-supervised classification, hierarchical multi-label
PDF Full Text Request
Related items