Font Size: a A A

Research On The Method Of Ensemble Learning Based Protein Sequence Classification

Posted on:2019-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhaoFull Text:PDF
GTID:2310330563953929Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid developments of information technology,computer science has gradually penetrated into many fields of biological information,so a new discipline,bioinformatics,emerged.Protein is the essential part of organisms,which means a better understanding of the pathological mechanism and drug design could be got by involving study.Now,bioinformatics is in the post-genomic era,and a large amount of protein for a variety of organisms has been sequenced.But it is impossible for researchers to study all protein by experimental methods,which cannot convert sequences into corresponding scientific knowledge in time.As a result,computational methods,which is based on machine learning and mathematical statistics,are proposed to predict protein function.In this way,researches could study protein by developing stable and efficient algorithm.Based on ensemble learning,further research on classification prediction of protein sequences has been accomplished in this thesis.The main contents are as follows.1)In order to extract information from protein sequences more effectively,a new sequence-based method was used to formulate protein sequences,named g-gap tripeptide compositions.Besides,we proposed a feature discretization method based on the idea of functional domain.It was observed that,in k-fold cross-validation,these two methods both produced good classification results on the phage virus protein dataset.In addition,by ensemble the features of dipeptides with different intervals,makes the information between the features complementary,and has also achieved a good classification result.2)The ensemble methods used in bioinformatics are mostly based on a certain feature to fit many kinds of models and combine them by voting.In order to make full use of different algorithms which train the data from angles,making all models fully complemented,we proposed an ensemble method by constructing multiple base classifiers on multi-feature space,and integrated the results of base classifiers by logistic regression or decision trees.3)A new ensemble method based on logical operation was proposed.This method only adopts four kinds of logical operations,AND,NAND,OR and NOR.It also avoid the requirement of traditional ensemble method on base classifier's differences,which means,even for similar base classifiers,it can also achieve a better ensemble result,The effectiveness has been verified on phage virion dataset.
Keywords/Search Tags:protein sequence classification, ensemble learning, feature extraction
PDF Full Text Request
Related items