Font Size: a A A

Research On Protein Subcellular Localization Prediction

Posted on:2018-08-31Degree:MasterType:Thesis
Country:ChinaCandidate:N ZhaoFull Text:PDF
GTID:2370330575967105Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The study of protein subcellular localization plays an important role in the understanding of cell life activities,the inference of unknown protein functions,the diagnosis of diseases and the development of new drugs.With the rise and development of bioinformatics,large numbers of protein sequence data have been generated.At the same time,a large number of protein subcellular localization prediction methods are proposed.In this study,bag of words(BOW)model is introduced to improve the traditional feature extraction algorithm.And the support vector machine(SVM)classifier is used to predict the protein subcellular location.Good experimental results are obtained.The main works of this paper are as follows:Firstly,we choose two apoptosis protein datasets ZD98 and CH317,as well as gram negative bacteria data set Gram796 which is constructed by standard data set construction method.And we make use of the classical SMOTE algorithm to restructure the data sets.Secondly,in order to improve the recognition accuracy of traditional protein sequence features,we propose a BOW feature extraction algorithm.The algorithm uses the BOW model combining feature extraction algorithms such as amino acid composition and pseudo amino acid composition,which makes use of a large number of scientific computing to extract amino acid composition information and location information of protein sequences as comprehensively as possible.The algorithm mainly consists of four stages which are protein sequence segmentation,feature extraction of sequences words,the construction of dictionary by kmeans and statistical calculation.The algorithm can effectively transform the amino acid sequences of the protein into feature vectors,which can provide good samples for subsequent classification and prediction experiments.Thirdly,in order to improve the experimental efficiency and form a highly scalable computing platform,which provides the possibility for carrying out large-scale biological data classification.We build the Hadoop cluster and use the MapReduce programming framework to achieve the parallel computation of BOW feature extraction.Fourthly,in order to carry out the effective localization prediction experiment,we construct SVM multi class classifier to predict the protein subcellular location.And we use genetic algorithm and grid search to optimize the model parameters which can improve the performance of SVM.In order to carry out feature extraction and classification prediction on different data sets simultaneously,we use the parallel computing toolbox PCT of MATLAB to realize the multi-core parallel computing of location prediction tasks.The overall experimental efficiency is improved.Fifthly,in order to test algorithm performance,we perform objective and efficient Jackknife tests on the data sets ZD98,CH317 and Gram796,and we use three evaluation indicators that sensitivity(Sn),specificity(Sp)and correlation coefficient(MMCi),as well as the total accuracy rate(A)to evaluate the algorithm.The prediction success rates on the data sets ZD98,CH317 and Gram796 are 94.3%,93.8%and 93.7%.The values of Sn,Sp and MMCi have different degrees of improvement.The experimental results show that extracting BOW feature of protein sequences and putting the feature into SVM classifier to forecast the protein subcellular location are effective methods for protein subcellular localization prediction.Finally,we combine particle swarm optimization(PSO)and bacterial foraging algorithm(BFA)to optimize the BOW feature extraction algorithm.The parameter search space of BOW feature extraction algorithm is mainly composed of the length of protein sequence segmentation(d)and the size of dictionary(k).Optimized BOW feature extraction algorithm by PSO_BFA can find one or more sets of parameters(d,k)in a short time to make the corresponding BOW features have high recognition accuracy.The success rates of ZD98,CH317 and Gram796 data sets are 95.9%,95.1%and 94.1%.
Keywords/Search Tags:subcellular locations, protein sequence characteristics, bag of words model, particle swarm optimization, bacterial foraging, support vector machine
PDF Full Text Request
Related items