Font Size: a A A

Research On DBP Identification And Classification Based On Machine Learning

Posted on:2021-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:D X HuoFull Text:PDF
GTID:2370330611452113Subject:Engineering·Computer Technology
Abstract/Summary:PDF Full Text Request
DNA binding protein(DBP)participates in a variety of cell activities and plays a key role in the genetic evolution of living organisms.DBP can be divided into singlestranded DNA binding protein(ssDBP)and double-stranded DNA binding protein(dsDBP),playing different roles in life activities such as DNA replication and reorganization and DNA regulation and transcription,respectively.Study of DBP is the basis for our exploration and explanation of the mysteries of living organisms' development,evolution,diseases,and cancer.The study of identifying DBP and classifying DBP helps to discover the relationship between protein structure and function.DBP can be identified by traditional biological experimental techniques,such as Filter-Binding Assays,X-Ray Diffraction Crystallography,ChIP-chip,NMR,etc.,however,traditional experiments require expensive experimental equipment and are very time-consuming.Moreover,the explosive growth of newly discovered protein sequences has made it difficult to perform large-scale identification and classification in traditional experiments.With the advancement of protein annotation work and the development of machine learning algorithms,in recent years,researchers have used supervised learning to quickly identify DBP only using information extracted from protein sequences,which has greatly promoted research in this field.First,machine learning algrithom is used to construct a prediction model for identifying DBP and a model named as Multi-Feature Fusion-Selection based on sequence information—MFFS-IdentDBP is proposed.The construction process of this model uses 11 feature extraction methods to obtain a variety of effective feature information from the protein sequences,combined with feature fusion and elastic net to obtain a feature vector representing the protein.The accuracy rate,MCC and AUC of the prediction results of the model on the test set and the independent test set are 0.93,0.86,0.97 and 0.83,0.67 and 0.86,respectively,which are superior to the existing 14 methods of identifying DBP.Secondly,the MFFS method is applied to the classification research of DBP and a DBP classification model—MFFS-PreSDBP is constructed to classify DBP into ssDBP and dsDBP.In this paper,a new method of dividing the samples of the data set is used to effectively solve the problem of overfitting caused by the imbalance of the number of positive and negative samples in Uniprot1065.The model can accurately classify the positive and negative samples in the test set,and the accuracy,F1,and MCC of the prediction results of the independent test set reach 0.81,0.88,and 0.44,which are higher than the existing classification methods.Both the DBP identification and DBP classification models presented in this paper show good performance,indicating that the MFFS method can effectively obtain the feature information of protein sequences,and the corresponding feature attributes can be further applied to the analysis and research in the field of protein.
Keywords/Search Tags:DBP identification and classification, machine learning, feature selection, BP neural network
PDF Full Text Request
Related items