Font Size: a A A

The Prediction Of Protein Interactions Based On Integrated Learning Model

Posted on:2020-11-23Degree:MasterType:Thesis
Country:ChinaCandidate:C XuFull Text:PDF
GTID:2370330572996904Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
In the post-genomic era,the research of proteomics has been in full swing.The study of protein interactions not only helps to reveal the nature of life activities,but also plays a driving role in understanding the mechanisms of disease activity and the development of effective drugs.The rapid development of machine learning provides new opportunities and challenges for understanding the mechanism of protein interactions.It plays an important role in the field of proteomics research.In recent years,more and more computational methods for predicting protein interactions have been developed.In this paper,models are based on the idea of integrated learning,and combined with Random Forest(RF)and Support Vector Machine(SVM)algorithms to predict protein-protein interactions.This article includes the following directions:(1)Protein-protein interactions(PPIs)play a key role in various biological processes.Many methods have been developed to predict protein-protein interactions and protein interaction networks.However,many existing applications are limited,because they rely on a large number of homology proteins and interaction marks.In this paper,we propose a novel integrated learning approach(RF-Ada-DF)with the sequence-based feature representation to identify proteinprotein interactions.Our method firstly constructs a sequence-based feature vector to represent each pair of proteins,via Multivariate Mutual Information(MMI)and Normalized Moreau-Broto Autocorrelation(NMBAC).Then,we feed the 638-dimentional features into an integrated learning model for judging interaction pairs and non-interaction pairs.Furthermore,this integrated model embeds Random Forest in AdaBoost framework and turns weak classifiers into a single strong classifier.Meanwhile,we also employ double fault detection in order to suppress over-adaptation during the training process.To evaluate the performance of our new method,we conduct several comprehensive tests for PPIs prediction,and compare with existing best methods.On the H.pylori dataset,our method achieves 88.16% accuracy and 87.68% sensitivity,the accuracy of our method is increased by 0.57%.On the S.cerevisiae dataset,our method achieves95.77% accuracy and 93.36% sensitivity,the accuracy of our method is increased by 0.76%.On the Human dataset,our method achieves 98.16% accuracy and96.80% sensitivity,the accuracy of our method is increased by 0.6%.Experiments show that our method achieves better results than other outstanding methods for sequence-based PPIs prediction.(2)Ligand-receptor interactions(LRIs)play an important role in signal transduction required for cellular differentiation,proliferation,and immune response process.The analysis of ligand-receptor interactions is helpful to provide a deeper understanding of cellular proliferation/differentiation and other cell processes.The computational technique would be used to promote ligand-receptor interactions research in future proteomics research.In this paper,we propose a novel computational method to predict ligand-receptor interactions from amino acid sequences by a machine learning approach.We extract features from ligand and receptor sequences by Histogram of Oriented Gradient(HOG)and Discrete Cosine Transform(DCT).We propose two models on the ligand-acceptor dataset(unbalanced dataset).The Neighborhood fuzzy model divides the dataset into several sub-datasets by using Fuzzy C-means(FCM)clustering,and trains several sub-classifiers by SVM,then uses the similarity measure(distance measure)to select the optimal sub-classifier for training.The Ensemble fuzzy model uses FCM and bootstrap to divide the dataset into several balanced sub-datasets,then trains several sub-classifiers,and lastly obtains the final result by voting from these sub-classifiers.In order to verify the performance of our models,we perform five-fold cross-validation experiments on a ligand-receptor interactions dataset and achieve 80.08% accuracy,82.98% sensitivity and 80.02% specificity.Compared to the use of a single SVM classifier(the sensitivity value is 46.28%),the sensitivity of our model is increased by 36.7%.Then,we test our extracted feature method on two protein-protein interactions datasets,and achieve accuracies of 93.79% and 87.46%,respectively.Our proposed method can be a useful tool for identifying of ligand-receptor interactions.
Keywords/Search Tags:Protein-protein interaction, Ligand-receptor interactions, Double fault detection, Feature extraction, Support vector machine, Bioinformatics
PDF Full Text Request
Related items