Font Size: a A A

Research And Application Of Protein Sequence Classification Based On Multiple Kernel Learning

Posted on:2020-12-02Degree:MasterType:Thesis
Country:ChinaCandidate:Q Y LianFull Text:PDF
GTID:2370330596475058Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Predicting the function of protein has become a hotspot in the field of biological proteins.Thermophilic proteins can be used as biocatalysts in extreme environments,which have the advantages of accelerating chemical reactions,reducing industrial manufacturing costs,and reducing energy consumption.Therefore,the effective prediction of thermophilic proteins is an essential procedure in various manufacturing industries.With the advancement and implementation of the Human Genome Project,more and more protein sequences have been evaluated.Traditional methods of identifying protein function have been unable to meet the demand due to their long time-consuming and low efficiency.It is extremely urgent to develop real-time and effective methods of predicting protein function.The prosperity of machine learning algorithms and the enhancement of computing capabilities have facilitated the data mining on biological field.This paper mainly studies the application of multiple kernel learning algorithm in predicting the function of protein sequence.The specific research contents are as follows:1)In order to depict the protein better,a new feature extraction method based on word2 vec is proposed.The method converts each dipeptide into a word vector representation using the word2 vec algorithm by treating the protein sequence as a text sentence composed of dipeptide.Then a vector representation of the sequence could be calculated based on the dipeptide in the protein sequence.Experimental results show that this method can improve the prediction accuracy of the model.2)The first step of the multiple kernel learning method is to select the basic kernel function,including the number,class and internal parameters of the kernel function.The conventional method of selecting the basic kernel function is blind,time-consuming and labor-intensive,as a result,a kernel function selection method based on greedy algorithm is proposed in this paper.The method considers that the feature vector mainly comes from different feature extraction methods.Therefore,the number of kernel functions is initialized by the number of feature extraction methods.Then for the feature groups corresponding to different feature extraction methods in the feature vector,the greedy algorithm is utilized to select the optimal kernel function,and the selection result of the basic kernel function is obtained.3)We propose a protein sequence classification model based on multiple kernel learning.Compared with other methods,the multiple kernel learning methods are more flexible.In this paper,we use the kernel function selection method based on greedy algorithm to complete the selection of the basic kernel function firstly.Then the best combination kernel function could be learned through the simple multiple kernel learning algorithm.Finally,we study the classification model using the best combination kernel as the kernel function of SVM algorithm.The experimental results show that the model can identify the thermophilic protein well.On the thermophilic protein sequence dataset used in this paper,the 10-fold cross-validation results are: the accuracy rate is 94.72%,and the recall rate of thermophilic protein is 94.84%.The value of MCC is 0.8939 and the value of ROC-AUC is 0.9859,which is superior to other machine learning methods and existing methods.4)A web service for predicting thermophilic protein sequences has been developed to facilitate the use of the models presented in this paper by other relevant researchers.
Keywords/Search Tags:thermophilic protein sequence classification, multiple kernel learning, support vector machines, feature extraction
PDF Full Text Request
Related items