Font Size: a A A

Research On Methods For Multiplex Protein Subcellular Localization Prediction Based On Machine Learning

Posted on:2014-12-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Z CaoFull Text:PDF
GTID:1260330425977363Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Protein subcellular localization information is very significant for deducing protein function, understanding the action of cells, identifying drag target, revealing disease pathogenesis and so on. In the last ten years the amounts of protein sequences grew rapidly, and the protein subcellular localization method based on intelligent computation has been a focus in the fields of systems biology and bioinformatics. This dissertation employs machine learning to research on relevant problems of multiplex protein subcellular localization prediction, and the main contents are as follows:1. Propose an imbalance-weighted multi-label K-nearest neighbor algorithm for the problem that distribution of protein data is imbalanced. This algorithm estimates the posterior probabilities about the subcellular locations of unseen samples by using the statistical information of their nearest neighbors, distributes proper imbalance weightings for the samples in every class according to their data distribution, and designs decision function based on maximum a posterior principle and the imbalance weightings. The results on several imbalanced protein datasets indicate that the proposed algorithm achieve superior performance to two popular algorithms of multiplex protein subcellular localization prediction, i.e. Cell-mPLoc2.0and iLoc-Cell, and it can effectively reduce the impact on prediction brought by data imbalanced problem.2. Propose a method to construct training dataset based on information mining of proteins with non-experimental annotations for the problem of lacking enough protein training data. This method introduces proteins with non-experimental annotations, evaluates this kind of proteins by active learning strategy, and selects the most valuable samples to add them into the original training dataset for getting a new training dataset with more information. The experiments on several datasets show that the performances of four classifiers, i.e. IMKNN, SVM, Gaussian process and ML-RBF can be improved, and the bottleneck of training data shortage can be validly overcome.3. Propose an ensemble prediction method based on preliminary protein identification for the problem that it is hard to achieve high-accuracy prediction results for both singleplex proteins and multiplex proteins using only one classifier. This method identifies the types of unseen proteins based on transductive learning, and then applies independent classifier into singleplex and multiplex proteins respectively. The experiment results on several datasets indicate that the proposed method can effectively identify the types of the unseen proteins, and get better performance than two popular algorithms of multiplex protein subcellular localization prediction, i.e. Cell-mPLoc2.0and iLoc-Cell.
Keywords/Search Tags:Protein Subcellular Localization Prediction, Multiplex Protein, Imbalanced DataLearning, Active Learning, Transductive Learning
PDF Full Text Request
Related items