Font Size: a A A

Protein sequence classification with Bayesian supervised and semi-supervised learned classifiers

Posted on:2009-02-03Degree:Ph.DType:Dissertation
University:State University of New York at AlbanyCandidate:King, Brian RFull Text:PDF
GTID:1448390002492409Subject:Biology
Abstract/Summary:
Bioinformatics, an interdisciplinary field between computer science and biology, has emerged primarily out of the need for methods to automate the analysis and annotation of newly discovered biological data. In the last decade, there has been an exponential growth in the size of gene and protein sequence repositories resulting from rapid advancements in high-throughput experimental techniques. This resulted in a huge inventory of genes with unknown function. Gene function is executed primarily at the protein level; hence, understanding the functional role of proteins in a species can yield substantial biological information about that species which may have potential applications in biomedical research. Unfortunately, experimental characterization of protein function is tedious and not feasible for all genes. Alternatively, computational methods can complement experimental efforts in annotating the vast amount of these data lacking functional and/or structural characterization.;The computational fields of machine learning and data mining continue to provide a much needed framework for numerous methods that can assist in categorizing these unannotated proteins. Supervised learning methods have played a dominant role in helping us better understand some newly discovered proteins with respect to their functional and/or structural characterizations. As with all supervised learning methods, all training data used for model induction must be labeled.;Semi-supervised learning methods, which learn from labeled and unlabeled data, can have a significant impact on this field of research due to the relatively large amount of unlabeled data available. In theory, this unlabeled data represents a large pool of untapped information that can be used to improve models based on labeled data. Unlike supervised methods, semi-supervised methods are only beginning to emerge in this field, and are often developed with restrictive requirements that make them unsuitable for analysis and characterization of large-scale sequence data.;This dissertation explores the development of supervised and semi-supervised learned classification methods designed to classify large-scale protein sequence data. A central aim of this research is to develop methods that require only the protein sequence for classification. The majority of work is based on the well-known Naive Bayes classification framework, which has been proven to perform well in the field of text classification. The parameterized, probabilistic model is developed through observing occurrences of fixed-length subsequences throughout the labeled data. Unlabeled data is used to improve the model by extending the method by incorporating the Expectation-Maximization algorithm.;Using the task of predicting the subcellular localization of a protein sequence, performance results from the supervised method show superior performance over existing methods. Moreover, the subcellular proteome of numerous eukaryotic and prokaryotic species are estimated with far greater coverage than any other method known at the time of this research. Performance results from the semi-supervised learning research will show that large repositories of unlabeled protein sequence data can indeed be used to improve predictive performance, particularly in situations where there are fewer labeled protein sequences available, and/or the data are highly unbalanced in nature. This dissertation has laid a foundation for exploration of numerous other characterizations of proteins on large-scale data.
Keywords/Search Tags:Protein, Data, Methods, Supervised, Classification, Field
Related items