Font Size: a A A

Research On Prediction Model Of Plant Moonlighting Protein Based On Machine Learning

Posted on:2022-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LiuFull Text:PDF
GTID:2480306332970889Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the post-genome and big data era,the number of sequences in biological databases has increased rapidly.Using sequence analysis the rules of proteins and genes has gradually become a research hotspot in bioinformatics.Moonlighting protein refers to a protein that can perform two or more functions.The method of biological experiment to find moonlighting protein have the shortcomings of accidental,time-consuming and labor-consuming.Machine learning methods are more efficient in predicting moonlighting protein.The current moonlighting protein database and prediction tools mainly focus on the proteins in animals and microorganisms,and there are differences in the cells and proteins between animals and plants,these may cause the existing tools to predict plant moonlighting proteins inaccurately.Hence,the availability of a benchmark data set and a prediction tool specific for plant moonlighting protein are necessary.The main tasks completed in this paper are as follows:(1)Research and integration of plant protein data.Firstly,600,000 proteins including 7 species were selected.Then,306 negative samples were screened according to conditions such as GO annotations and semantic similarity,and reduced sequence redundancy.Finally,a plant moonlighting protein benchmark data set with138 positive samples and 245 negative samples was constructed.(2)Feature extraction and pretreatment of plant protein.In order to obtain features more suitable for plant protein research,this study extracted 16 feature classes based on protein sequences.Then performed feature selection,feature normalization and feature dimensionality reduction on the dataset.Next,machine learning methods for preliminary modeling were used to select feature classes(TPC,Tripeptide Composition)that performed best in plant moonlighting protein prediction.(3)Prediction model construction based on machine learning.This research used five machine learning methods commonly used in bioinformatics to build models.The grid search and 5-fold cross-validation were used to optimize the learning model.And the prediction results indicated that the Support Vector Machine(SVM)performed best,which was used as the algorithm to construct the prediction tool,called Ident PMP(Identification of Plant Moonlighting Proteins).The results of the independent test set shows that the area under the precision-recall curve(AUPRC)and the area under the receiver operating characteristic curve(AUC)of Ident PMP is0.43 and 0.66,which are 13.89% and 10.00% higher than state-of-the-art non-plant specific methods,respectively.This further demonstrated that a benchmark data set and a plant-specific prediction tool was required for plant moonlighting protein studies.(4)Design and implementation of plant moonlighting protein prediction system.To make Ident PMP more convenient to use,this study has implemented the tool into a web version.By using this website,users can predict plant moonlighting proteins online,download plant protein benchmark data set and packages.The Ident PMP is the first attempt to build a moonlighting protein prediction tool specific for plants.We hope that Ident PMP will provide better services for research work of plant science and proteomics.
Keywords/Search Tags:Machine learning, Plant moonlighting protein, Prediction tool, Benchmark data set, Proteomics
PDF Full Text Request
Related items