Font Size: a A A

Prediction Of DNA-binding Proteins Based On Comprehensive Characteristics Of Protein Sequences And Ensemble Learning

Posted on:2021-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:P C ChenFull Text:PDF
GTID:2480306458491814Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
Protein-DNA interactions are closely related to the most basic life activities in organisms,such as gene transcription and regulation,DNA replication and repair,chromatin and ribosome formation.These proteins are called DNA-binding proteins,which play an important role on studying the internal mechanism of disease and designing drug targets.With the development of post genomic technology and the implementation of high-throughput sequencing technology,it has become a new trend to identify DNA-binding proteins based on computational methods.We focus on feature engineering in the DNA-binding proteins classification,mainly including feature extraction and feature selection.The prediction model of DNA-binding proteins can be established,which is accurate and effective.The main contents include four parts:(1)Dataset construction: We search “DNA-binding protein” keywords in Protein Data Bank(PDB)database and extract the data,the sequences with length less than 50 amino acids and "X" residues were deleted.Then,CD-HIT and BLASTCLUST software are employed to reduce the redundancy between sequences and construct training and test datasets.(2)Feature extraction: To fully exploit the underlying information of protein sequences,five types of features are combined based on amino acid composition,physicochemical properties and evolutionary information,which is generated by Pse AAC,Local?DPP,Sliding window,PSSM-DCT and PSSM400.(3)Feature selection: Aiming to increase the relevance and eliminate the redundancy of the extracted features,we utilize the Maximum relevance,minimum redundancy(m RMR),Random forest and LASSO algorithm to select features.Finally,the selected 130 dimensional features have the best performance by random forest method.(4)Classification prediction: We construct the prediction model of DNA-binding proteins with 10-fold cross-validation and XGBoost method on the training dataset.Furthermore,the effectiveness of the prediction model is verified on the test dataset.Accuracy,sensitivity,specificity,Matthews correlation coefficient and AUC are used to analyze the classification efficiency of the prediction model.Futhermore,we compare i DNA-Prot,DNA-Prot,i DNA-Prot|dis and DNA-binder methods with our method.It has the best performance in accuracy,specificity and Matthews correlation coefficient,which is higher than other methods: 0.1%-19.3%,0.47%-19.5% and 0.03-0.401,respectively.The results demonstrate that our method is better than i DNA-Prot and DNA-Prot.In addition,the Local?DPP and PSSM400 methods based on PSSM are the most important for recognizing DNA-binding proteins.
Keywords/Search Tags:DNA-binding proteins, feature extraction, feature selection, prediction model
PDF Full Text Request
Related items