Research On Protein Subcellular Localization Prediction

Posted on:2018-08-31

Degree:Master

Type:Thesis

Country:China

Candidate:N Zhao

Full Text:PDF

GTID:2370330575967105

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The study of protein subcellular localization plays an important role in the understanding of cell life activities,the inference of unknown protein functions,the diagnosis of diseases and the development of new drugs.With the rise and development of bioinformatics,large numbers of protein sequence data have been generated.At the same time,a large number of protein subcellular localization prediction methods are proposed.In this study,bag of words(BOW)model is introduced to improve the traditional feature extraction algorithm.And the support vector machine(SVM)classifier is used to predict the protein subcellular location.Good experimental results are obtained.The main works of this paper are as follows:Firstly,we choose two apoptosis protein datasets ZD98 and CH317,as well as gram negative bacteria data set Gram796 which is constructed by standard data set construction method.And we make use of the classical SMOTE algorithm to restructure the data sets.Secondly,in order to improve the recognition accuracy of traditional protein sequence features,we propose a BOW feature extraction algorithm.The algorithm uses the BOW model combining feature extraction algorithms such as amino acid composition and pseudo amino acid composition,which makes use of a large number of scientific computing to extract amino acid composition information and location information of protein sequences as comprehensively as possible.The algorithm mainly consists of four stages which are protein sequence segmentation,feature extraction of sequences words,the construction of dictionary by kmeans and statistical calculation.The algorithm can effectively transform the amino acid sequences of the protein into feature vectors,which can provide good samples for subsequent classification and prediction experiments.Thirdly,in order to improve the experimental efficiency and form a highly scalable computing platform,which provides the possibility for carrying out large-scale biological data classification.We build the Hadoop cluster and use the MapReduce programming framework to achieve the parallel computation of BOW feature extraction.Fourthly,in order to carry out the effective localization prediction experiment,we construct SVM multi class classifier to predict the protein subcellular location.And we use genetic algorithm and grid search to optimize the model parameters which can improve the performance of SVM.In order to carry out feature extraction and classification prediction on different data sets simultaneously,we use the parallel computing toolbox PCT of MATLAB to realize the multi-core parallel computing of location prediction tasks.The overall experimental efficiency is improved.Fifthly,in order to test algorithm performance,we perform objective and efficient Jackknife tests on the data sets ZD98,CH317 and Gram796,and we use three evaluation indicators that sensitivity(Sn),specificity(Sp)and correlation coefficient(MMCi),as well as the total accuracy rate(A)to evaluate the algorithm.The prediction success rates on the data sets ZD98,CH317 and Gram796 are 94.3%,93.8%and 93.7%.The values of Sn,Sp and MMCi have different degrees of improvement.The experimental results show that extracting BOW feature of protein sequences and putting the feature into SVM classifier to forecast the protein subcellular location are effective methods for protein subcellular localization prediction.Finally,we combine particle swarm optimization(PSO)and bacterial foraging algorithm(BFA)to optimize the BOW feature extraction algorithm.The parameter search space of BOW feature extraction algorithm is mainly composed of the length of protein sequence segmentation(d)and the size of dictionary(k).Optimized BOW feature extraction algorithm by PSO_BFA can find one or more sets of parameters(d,k)in a short time to make the corresponding BOW features have high recognition accuracy.The success rates of ZD98,CH317 and Gram796 data sets are 95.9%,95.1%and 94.1%.

Keywords/Search Tags:

subcellular locations, protein sequence characteristics, bag of words model, particle swarm optimization, bacterial foraging, support vector machine

PDF Full Text Request

Related items

1	The Study Of Support Vector Machine Time Series Prediction That Optimized By Particle Swarm Optimization
2	Research On Prediction Of Protein Domains Based On Support Vector Machines
3	Research And Application Of Hybrid Time Series Model Based On Support Vector Machine
4	Research Of Protein Subcellular Location Using Machine Learning Algorithms
5	Eukaryotic Gene Promoter Recognition Based On Optimized Support Vector Machine
6	Protein Subcellular Localization Prediction Based The Fusion Characteristics
7	Inversion Of Rock And Soil Mechanics Parameters Based On PSO Optimization Wavelet Support Vector Machine
8	The Research On Protein Sequence Feature Extraction And Its Application On Protein Subcellular Location
9	The Research On Prediction Of Protein Subcellular Location Using Multi-information Fusion Based On Sequence
10	Support Vector Machine Approach For Protein Mesophilic & Thermophilic Recognition And Protein Subcellular Localization Prediction