Font Size: a A A

Machine Learning Based Protein Subcellular Localization Prediction

Posted on:2011-11-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:S Y MeiFull Text:PDF
GTID:1100360305497259Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As an important research field in molecular cell biology and proteomics, protein subcellular localization is closely related to protein function, metabolic pathway, signal transduction and biological process, and plays an important role in basic biological research and biomedicine research. Computational models based protein subcellular localization prediction is cheap, fast, effective and widely applicable. Through statistical analysis on large amount of protein data, computational models can be used to find effective protein feature information and make a general statistical inference about the law between protein feature information and protein subcellular localization pattern. In the recent years, the research field of protein subcellular localization prediction has witnessed great progresss. However, the previous protein subcellular localization predictive models have several disadvantages:firstly, the protein feature information is not fully mined, so that some important aspect of protein information is ignored; secondly, the data integration models generally concatenate heterogeneous feature spaces, or adopts majority votes based ensemble learning, so that no explicit importance evaluation is individually conducted for different protein feature information, and the problem of data unavailability is not handled; finally, the previous models achieve relative poor performance on unbalanced protein data, protein sub-organelle localization and large-scale protein subcellular localization.This paper conducts research on novel predictive methods for protein subcellular localization from the standpoint of machine learning, for the purpose of improving the predictive performance of protein subcellular localization and endowing the models with reseanable biological interpretation. The paper contributions are summarized as follows:1. Introducing multi-instance learning method into protein subcellular localization prediction, in order to fully exploit the ignored protein domain information:domain composition, domain boundary partition information and the order of domain along protein sequence. On one hand, multi-instance learning is introduced to capture the local structural information of protein sequence in terms of protein domain; on the other hand, multi-label learning is introduced to handle the problem of multiple protein subcellular locations, thus introducing a new way to protein subcellular localization prediction. The proposed multi-instance learning method uses bag-instance representation to describe the whole-part relation between protein sequence and domain, thus effectively exploiting the local structural information of protein sequence. The experiment on Gram-positive bacteria protein data shows that the sequence based multi-instance learning method achieves performance equivalent to the gene ontology based k-NN ensemble learning model.2. Proposing a spectrum kernel SpectrumKernel+ to incorporate multiple amino acid classification information into k-mer feature representation, based on which to simulate multiple sequence motif evolution patterns. SpectrumKernel+ interpretes the biological implication of incorporating amino acid classification information into k-mer feature representation, in terms of physiochemical constraints on protein sequence evolution, and makes connection with classicial spectrum kernel and (k,l) mismatch kernel, endowing the model with more reasonable biological meaning and intuitive biological interpretation. SpectrumKernel+ incorporates multiple amino acid classification information to measure the difference between two sequences'motif evolution patterns& motif distributions, based on which to more accurately define the similarity between two protein sequences. As compared to general protein subcellular localization prediction, protein subnuclear localization prediction seems more challenging. The experiments on two subnuclear protein datasets show that SpectrumKernel+ outperforms the baseline models.3. Proposing a fused multi-instance kernel HoMIKernel+ to incorporate the fine-grained information of full homologous sequences. The evolutionary conservation and divergence determine the fact that the information of homologous sequences is the vague descriptor of the target protein's subcellular localization pattern. The vagueness is consistent with the positive instances'vagueness in terms of describing object label in multi-instance learning scenario, which is the combination of biological meaning with multi-instance learning method, and also is the standpoint for us to propose HoMIKernel+. HoMIKernel+ uses the k-mer feature representation of homology set to describe the target protein, so that the motif distribution of the target protein is enhanced and the noise is compressed. The experiments on one prokaryotic dataset and three eukaryotic datasets show that outperforms the baseline models; and that homology incorporation benefits the predictive performance; and that multiple multi-instance kernel fusion significantly increase the predictive accuracy.4. Proposing two machine learning models:homology based knowledge transfer learning model and statistical correlation based knowledge transfer learning model; and proposing a simple non-parametric cross validation method to estimate the weight distribution of linear kernel combination, based on which to achieve knowledge share between homologous and statistically correlated proteins, and to reduce the time& space complexity of kernel weight estimation. The relatedness between the target protein and the auxilary proteins is derived through intuitive biological meaning, based on which to transfer to the target protein the gene ontology knowledge of homologous proteins and statistically correlated proteins. A multiple kernel learning system is constructed on the transferred knowledge for protein subcellular localization prediction. homology based knowledge transfer demonstrates the following advantages:to enrich the gene ontology knowledge about target protein, to overcome the data unavailability of novel protein and those proteins with few biological evidence; Statistically correlation based knowledge transfer demonstrates the following advantages:to enrich protein gene ontology knowledge, to tune the weight distribution among the three aspects of gene ontology, to incorporate the gene ontology semantic distance, to adjust the gene ontology term coverage, to reduce the missing rate of test gene ontology term, to avod retraing model for novel protein prediction, etc. The kernel weight estimation takes into account the Matthew correlation coefficient (MCC) measure of performance bias to perform better on large-scale unbalanced protein data. The experiments on 8 benchmark datasets show that homology based knowledge transfer learning model and statistical correlation based knowledge transfer learning model significantly improve the performance of protein subcellular localization prediction, to a certain degree to reduce the unfavorable impact of noise and outlier that may be introduced by gene ontology knowledge transfer, overcome the performance bias towards large subcellular locations, and perform well on large-scale unbalanced protein data.
Keywords/Search Tags:protein subcellular localization, machine learning, multiple kernel learning, multi-instance learning, transfer learning
PDF Full Text Request
Related items