Font Size: a A A

Prediction Of Plant Protein Function Based On Multi-Label Classification Algorithms

Posted on:2020-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:2370330596992298Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the era of big data,the hotspot of research is to excavate valuable information.The purpose of this study is to discover the functions of unknown proteins.It has been proved that the functions of proteins are diverse,so this thesis uses multi-label classification algorithm which can deal with multiple functional classes at the same time.In order to obtain the known protein dataset,66 341 protein sequences of 43 GO functional classes were crawled from GO database and UniProt database.In this thesis,BR,CC,RAKEL algorithm based on problem transformation and BR-KNN,ML-KNN,BPMLL algorithm based on algorithm transformation are studied in depth.The methods of feature extraction based on physical and chemical properties,n-gram and k-skip-n-gram are studied,and feature fusion is used.Experiments show that the average accuracy of feature fusion based on n-gram and physical and chemical properties is 0.008 higher than that based on n-gram only.The physical and chemical feature extraction method is improved to make the generated feature data set meet the requirements of multi-label classification.In this thesis,six feature processing methods and six multi-label classification algorithms are combined to solve the multi-label classification problem quickly and effectively.Finally,the data set is cross-tested on 36 combinations.Experiments show that 188-dimensionalMLKNN is the best in Hemming loss,0-1 error rate,coverage rate,sorting loss and average accuracy,Compared with the second best combination,0.003,0.0215,0.2383,0.0043 and 0.0176 were better.The training time was only 2.9 minutes after the least20-dimensional MLKNN.The 188-dimensional MLKNN was combined to form a new algorithm.The integrated algorithm was packaged into MultiLabel.jar and applied to predict the function of 4423 unknown proteins in Dishaogua.Its functions include: DNA-binding transcription factor activity,RNA polymerase II specificity,endonuclease activity,and the like.
Keywords/Search Tags:data mining, prediction of go function of protein, multi-label classification, feature extraction
PDF Full Text Request
Related items