Font Size: a A A

Research On Prediction Methods Of Protein-protein Interaction Sites

Posted on:2019-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y WangFull Text:PDF
GTID:2370330548478918Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
In the post-genomic era,Protein-protein interactions(PPIs)are a hot research topic in bioinformatics.Research on the interactions between proteins can help us understand life activities and solve some problems related to drug design and development.This thesis adopts computational methods,especially through the treatment of imbalanced data sets to predict interaction sites between proteins.The main work of this thesis is listed as follows:Firstly,we predict PPI sites by the evolutionary conservation of amino acids.In this thesis,five characteristics are extracted from the HSSP database and the Consurf Sever to describe the evolutionary conservation of amino acids,i.e.,the spatial sequence of the residues,the information entropy and relative entropy of the residues,the conserved weights of the residue sequences,and the evolution rate of residues.A support vector machine classifier is constructed based on these five features,and the experimental results show that they can effectively predict PPI sites.By analyzing the relevant definitions of current PPI sites,we find that there is an imbalance between positive and negative samples in the data sets.This thesis uses three methods to process unbalanced data sets to further improve the prediction accuracy.The first one is based on the edited nearest neighbor(ENN)to process the negative sample data sets;the second is based on the combination of the ENN and the boundary noise factor method to handle the data sets,this method takes into account the sample overlap problem;the third is based on the sub-sampling method of k-means clustering,this method avoids the deletion of large class samples containing important information as much as possible,and makes the extracted large class samples more representative.From the experimental results,it is found that all of these three methods can reduce the sample imbalance impact in the prediction of PPI sites and effectively identify the interaction sites.Among the results obtained by the third method,the accuracy reaches 75.8%,and the sensitivity and specificity are67.6% and 84.2%,respectively,which is higher than the performance achieved by other similar works.Our work has implications for the solution of data imbalance problems that widely exist in the field of bioinformatics.
Keywords/Search Tags:Protein interaction sites, Conservative features, Support vector machines, Unbalanced data sets, Edited Nearest Neighbor, Boundary noise factors, Clustering
PDF Full Text Request
Related items