Font Size: a A A

Predicting The Subcellular Localization Of Proteins With Multiple Sites Based On Multiple Features Fusion

Posted on:2016-09-16Degree:MasterType:Thesis
Country:ChinaCandidate:X M QuFull Text:PDF
GTID:2180330464969116Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Predicting the protein sub-cellular localization is to predict the concrete localization for some proteins or gene expression products in a cell. That is according to the information contained in the protein sequence and classification and recognition of algorithm to predict the sub-cellular location or cell organelle of proteins. Predicting the protein sub-cellular localization laid foundation for the study of proteomics and closely related to protein function. Protein is the bearer and executor of all life activities, but only when located in the specific and accurate sub-cellular locations can the protein play its functions fully and attend life activities normally due to the close relationship with its sub-cellular attributes. So predict the sub-cellular locations of proteins accurately is the prerequisite and guarantee for the organism to run normally and orderly. So predicting the protein sub-cellular localization is important for the study of organism physiological mechanism. Making the protein locations clear plays crucial role in studying the pathogenesis of diseases and discovery targeting cell drug and exploring the regularities and mysteries of life.In recent years, as the development of technologies and computer science, the protein number got by high throughput experiments growth in geometric growth level, traditional experiment methods are not only time-consuming but also laborious, and they can’t meet the needs of study any more. So machine learning methods are used to deal with the prediction problem. It has also been found that some proteins are not exists in only one sub-cellular location, they exist or move between two or more sub-cellular locations. According to the statistical results of DBMLoc, more than thirty percent of proteins have been found have more than one sub-cellular locations, and this kind of proteins usually have special functions and much more important biological significance to the organism, develop faster and more efficient computing method to predict the multi-site locations for the proteins is becoming much more important, the number of multi-site proteins is increasing, the protein sub-cellular localization prediction develop to multi-site localization prediction. This kind of protein is the typical multi-label learning problem which is much more complicate than single-site localization prediction. So the researchers shifted their attention to the multi-label learning and got satisfied performance. Similar to the single location prediction for the prediction procedure, protein sub-cellular multi-site localization prediction also includes the following four steps: the first procedure is the construction of dataset; the second is features represent of protein sequence; the third is design and implementation of classifier and the fourth is prediction algorithm evaluation.In this paper, we use two datasets: one is the dataset s1 in construction of Gpos-mploc as the benchmark dataset; the other is dataset s2 in establishing Virus-mPLoc. The two datasets contain both single and multiple location proteins simultaneously. The feature extraction methods we used in this paper is N terminal signal, Pseudo amino acid composition(PseAAC), physical and chemical composition(PC), amino acid index distribution(AAID) and stereo-chemical properties(SP). In the process of feature extraction, we combined the N terminal sorting signals model and PseAAC model, first divide the protein sequence into two parts and then use PseAAC to extract the feature information of the two parts, got a 80 dimension feature vector, then use PC, AAID and SP model to extract features and got 21 dimension, 40 dimension and 10 dimension feature vector respectively. Then use different feature fusion schemes to fuse the feature extraction methods and got different feature vectors with different dimensions in the two datasets, then use multi-label k nearest neighbor algorithm and JackKnife evaluate, we got satisfied results. The results of ML-KNN have superior performance compared with flexible neural tree.
Keywords/Search Tags:feature extraction, multi-label learning, Pseudo amino acid composition, multi-label K nearest neighbor, evaluation metrics
PDF Full Text Request
Related items