Font Size: a A A

Research Of Protein Secondary Structure Prediction Based On Ensemble Learning

Posted on:2021-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:H L LiangFull Text:PDF
GTID:2370330611465679Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The study of protein structure and function is one of the most important topics in modern bioinformatics and computational biology.Data mining and machine learning methods are often used to perform prediction or pattern recognition tasks and help in experimental analysis.In recent years,deep learning has been widely used in the field of sequence analysis,but it still has the problems of long training time and poor parallelism.The ensemble learning algorithm can not only save training time in a highly parallel manner,but also quickly improve the overall prediction accuracy of simple models.However,the direction of ensemble learning methods combined with neural networks is seldom studied.To this end,based on the ensemble methods including Bagging,Boosting and Stacking,and neural network CNN,this thesis studies the 8-state classification in protein secondary structure prediction.The main contributions of the thesis are as follows:(1)A hybrid model based on Bagging and CNN is proposed.The model replaces traditional simple classifiers such as SVM with deep CNNs,trains multiple deep CNNs in parallel and integrates their prediction results with relative majority voting,which effectively improves the prediction accuracy.Further,a new classifier coefficient calculation method and feature selection method are proposed to improve the overall prediction ability of the model.The experimental results show that the Bagging model using CNNs as homogeneous weak classifiers increases the accuracy of secondary structure prediction from 66% to 73% of a single CNN.(2)A hybrid model based on Boosting and CNN is proposed,which uses Adaboost as an instance of Boosting.The model treat multiple CNNs as homogeneous weak classifiers,while using the SAMME method for optimization.Furthermore,a hybrid model combining multiple Adaboost strong classifiers with the Bagging method is proposed.Experiments show that the algorithm can achieve a training accuracy of 97.00%,while the predicting accuracy reaching up to 77%.The accuracy of 74.29% can be achieved on the public data set CB513,exceeding the 70.3% state-of-the-art research.(3)A hybrid model based on Stacking and CNN is proposed.The algorithm divides the data set by the K-fold cross-validation method.The training process combines the characteristics of Bagging and Boosting.It can also overlay multiple layers of heterogeneous weak classifiers to improve the feature extraction ability of the model.Further,a partitioning method for dividing the original data set according to the length of the protein sequence is proposed in combination with the original hybrid model.Experiments show that the algorithm can further improve the prediction accuracy of heterogeneous weak classifiers.Using the sequence length division method combined with the Adaboost model,the accuracy of 76.71% can be achieved on the public data set Cull PDB,exceeding the highest 74.0% currently studied.
Keywords/Search Tags:protein secondary structure, convolutional neural network, Bagging, Adaboost, Stacking
PDF Full Text Request
Related items