Font Size: a A A

Research On Protein Subcellular Localization Prediction Under Multi-label Setting

Posted on:2014-11-03Degree:MasterType:Thesis
Country:ChinaCandidate:C Y SunFull Text:PDF
GTID:2250330401469208Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Protein subcellular localization predicts protein locations in different cell organelle or cell domain (named subcellular, such as nuclear, mitochondria, cytoplasm, cytomembrane and so on.) based on its sequence information. It is the foundations of proteomics and protein function research. In the previous research, protein subcellular localization was mainly considered as multi-class problem. However, it should be a typical multi-label classification problem since it is observed that a protein possibly exits in one or more subcellular locations from biological experiments. This paper researches protein subcellular localization under multi-label setting.Our study consists of two steps, one is to construct protein sequence features, and the other is to predict protein subcellular location using multi-labels classification algorithm.In the first step, amino acid composition, the basic physicochemical characteristics of amino acid and gene ontology are firstly discussed in this thesis in order to represent protein sequence features. Then, we propose the procedure of constructing protein sequence features and elaborate the methods of feature construction, such as the processing of original sequence, the constructing of Pse-AAC model and GO model. According to different features combination, we construct seven different feature representation datasets. Each of datasets contains six different species sub-datasets, which involve virus, plant, gram-negative, gram-positive, human, and eukaryote. Finally, we obtain42independent datasets.In the step of comparing algorithms of protein subcellular localization, we conducts experiment using four multi-label classification algorithms including OVR-κNN, ML-κNN, Rank-SVM and SVM-ML. Firstly, we select the optimal parameters of each dataset with grid search method (3-fold cross validation) based on recall measure. Then, we use the optimal parameters of dataset to get experiment results (10-fold cross validation). The results show that,(a) The method of using Pse-AAC model to repair binary GO model is most effective among seven different feature representation methods;(b) OVR-κNN algorithm is faster than other three algorithms;(c) The SVM-based algorithms achieve better experiment performance than κNN-based algorithms;(d) The recall measure of SVM-ML algorithm is better than that of Rank-SVM algorithm, also better than that of κNN-based algorithms. Finally, we compare our experiment results with other existing methods. The performance of our method is better than that of Cell-Ploc method and Cell-Ploc2.0method, similar to the performance of mGOASVM method.
Keywords/Search Tags:protein subcellular localization, sequence feature, gene ontology, supportvector machine, multi-label classification algorithm
PDF Full Text Request
Related items