Font Size: a A A

Ensemble Learning Based Prediction Of Protein Subcelluar Location

Posted on:2012-07-08Degree:MasterType:Thesis
Country:ChinaCandidate:L Y LiuFull Text:PDF
GTID:2178330335479720Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the successful implementation of human genome project and vigorous progress of modern molecular biology, a great number of biological data have been remained, which brings us to post-genome era. Due to bigness of biological data and the complexity of their relations in post-genome era, people have to use computer to store and deal with these data, which makes bioinformatics whose purpose is to unveil the biological meaning contained in these data by obtaining, processing, storing, retrieving and analyzing biological data produce. The more the similarity of protein sequence is, the more likely the same subcelluar location is, which was proved by early research papers. Homologous protein sequences have similar locations because they have similar sequence and similar biological function. The gap between the number of protein sequences and their function annotation in public database is becoming wider. It is time-consuming and costly to discover the biological law by experiments. So it is high demand to accurately predict protein structural, subcelluar location, protein-protein interaction from protein sequence directly.Predictions of protein subcelluar and subnuclear are important aspects in post-genome era. Gene expression makes protein synthesized in ribosome (cytoplasm in bacterial) and the genetic information delivered possible. Protein must be transported into its natural compartment after synthesizing in order to function biologically and make body work properly. All kinds of functions of cell and body will be affected badly if proteins locate poor compartments.Based on previous researchers, Evolutionary Fuzzy K-Nearest Neighbor and its ensemble used to predict gram-negative bacterial's five subcelluar locations and eukaryotic cell nine subnuclear locations is proposed. K-Nearest Neighbour(KNN) algorithm proposed in 1985 by Keller is to find K(set previously) samples which have the nearest distances under some distance metric from the test sample whose location will be assigned to the label which has the majority. Fuzzy K-Nearest Neighbors (FKNN) algorithm is almost the same, but there is difference. FKNN will give each class a membership degree while classifying. The bigger membership degree is, the more likely a sample belongs to a class. Introducing fuzzy theory to KNN algorithm to great degree reduce the impact of imbalance to accuracy and improve the accuracy.In our study, ensemble learning was introduced in protein subcelluar location in order to improve accuracy and generalization. The previous research demonstrates that the ensemble results can be effective only when each base learner has high accuracy as well as distinct difference. Based on this, on one hand, we chose pseudo amino acid (PseAA) model to extract features from protein primitive sequence on gram-negative dataset as the input of classifier.Based on evolutionary fuzzy k-nearest neighbor algorithm (EFKNN), we trained and established six base classifiers with adopting totally different k-values that play an important role in the procedure of training and classifying. In accordance with the outputs of the six base classifiers, a novel ensemble approach named accumulative vote quantity (AVQ) to integrating each output is proposed and good accuracy is obtained. On the other hand, in order to improve accuracy, make each base learner even more different and represent the protein sequence scientifically, we adopt Amino Acid composition, physical chemistry, composition, PseAA, and Quasi-seq-orde algorithm to extract protein sequence on dataset SNL9. Five EFKNN algorithm based base learners each of whose output will be ensembled in accordance with accumulative vote quantity as the output for test protein are trained by five independent datasets obtained from five different feature extraction methods. The exactness rate tested on each single location protein is 70.0% via jackknife test. The test results on two datasets indicate that our proposed model can be a bright prediction tool for subcelluar proteins or may at least give a complimentary contribution to the established methods or models. The AVQ method improves the prediction accuracy, enriches and develops the theoretical research and practical application of the ensemble learning.
Keywords/Search Tags:protein subcelluar location, ensemble learning, evolutionary fuzzy KNN, protein feature extraction, accumulative vote quantity
PDF Full Text Request
Related items