Font Size: a A A

Prediction Methods Of Functional Sites In Protein

Posted on:2019-04-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:H LiFull Text:PDF
GTID:1360330590466634Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the completion of the Human Genome Project(HGP),the life science has entered the post genome era,and the protein expressed by the gene has become a research hotspot.What plays a decisive role in the process of life is the interaction of proteins with their ligands to perform specific functions.The identification of functional sites of proteins interacted with their ligands has important guiding significance for in-depth understanding of protein structure and biological functions,especially for the treatment of some diseases or for the development of new drugs.In the paper,the authors carried out research on the prediction of two protein functional sites: protein-protein interaction sites and zinc-binding protein sites.Due to the explosive growth of protein data in recent years,the identification of protein functional sites relied on existing experimental methods and calculation methods does not meet the needs of biological workers.How to design and develop a series of reasonable,efficient and accurate predictions methods of protein functional site has become an important research topic in the field of bioinformatics.In order to further improve the prediction accuracy of protein-protein interaction sites and zinc-binding protein sites,machine learning method are used to integrate multiple prediction methods to develop new tools for the prediction of protein functional sites.Furthermore,for the imbalance problem of the data,the non-equilibrium sampling techniques,feature selection strategies,and data classification algorithms are studied using sampling techniques and ensemble learning methods,and some new methods for the prediction of protein functional sites are put forward.The main research work of the paper is as follows:1.In the prediction of protein-protein interaction sites,many prediction tools have been put forward and these tools have achieved certain applications,but few researchers studied them based on data imbalance.In the process of protein-protein interaction,the proportion of actual binding sites in the whole protein sequence is very small.There is a certain imbalance between the positive and negative sample data.The traditional machine learning methods easily make the result biased to majority class,which is not conducive to the identification of the protein-protein interaction binding sites of minority class.The traditional machine learning methods easily make the result biased to majority class,which is not conducive to the identification of the protein-protein interaction binding sites of minority class.Aiming at the imbalance problem of protein interaction dataset,the SMOTE algorithm is used to synthesize the samples of minority class and the k-nearest neighbor algorithm is utilized to linearly interpolate between minority samples to generate new samples to adjust the stability of the sample data.In the paper,a new method of radial basis function neural network basedon SMOTE(Radial Basis Function Improved by SMOTE,RBFIS)is proposed.The leave-one-out cross validation is used,and the appropriate oversampling rate is selected.Experiments show that the average performance indexes of the prediction are better improved by increasing the proportion of the minority class,and the prediction performance of minority class has also been greatly improved.At the same time,the combinations of different protein feature were tested,and the combination of multiple features is beneficial to improve the accuracy of the prediction of the minority class.2.At present,the prediction tools for zinc-binding sites with protein mainly adopt a single machine learning algorithm or integrate some classical algorithms.Few researchers have integrated the existing prediction tools.Considering the availability of protein sequence information,a linear regression method was used to integrate three classical prediction tools Zinc Explorer,zinc Finder,and zinc Pred,and a new predictor meta-zinc Prediction was proposed.The method integrates the numerical prediction results of the three tools and adjusts the optimization parameters until they are optimal.Tested on the non-redundant Zhao_dataset,the predictor meta-zinc Prediction greatly improves the overall performance of the prediction of four types of binding site residues.Moreover,the performance is tested on any one of the four types of binding site residues,which is better than the other predictors.To further demonstrate the robustness and accuracy of the integrated predictor,we tested it on a non-redundant independent test dataset(Collected Dataset).The prediction ability of the meta-zinc Prediction was better than the other three predictors,regardless of whether the zinc-binding sites contained four types of residues or a single residue.In order to facilitate the use of prediction tools,the authors developed tool software for the prediction method.3.Bayesian is a statistical method based on uncertainty theory and can effectively deal with incomplete or missing data.In the paper,the Bayesian method is used to integrate the prediction results of three different prediction tools,and the positive and negative sample information are integrated into the model.Even when a certain data value is missing,the missing values are filled up,and the prediction results will not deviate significantly.There is no need to set a cutoff threshold to classify.The probability that the sample belongs to a certain class is calculated,the class with the highest probability is the class to which the sample object belongs.The prediction scores of three sequence_based prediction methods Zinc Explorer,zinc Finder and zinc Pred were used as the attributes,and a new predictor Bayes_Zinc based on Bayesian method for the prediction of zinc-binding sites was proposed.The experiments showed that the average performance indexes MCC,recall and precision are superior to other methods,and have achieved good prediction performance in the whole[0,1] interval.4.The actual research shows that the number of zinc-binding sites from protein is very smallrelative to non-binding sites.The prediction of zinc-binding sites is a typical unbalanced two-class classification problem.To better improve the prediction accuracy of unbalanced data classification,and avoid the bias of traditional machine learning methods when classifying non-equilibrium data sets:firstly,the random sampling technique is used to perform a balanced sampling process on majority samples;Secondly,the base classifier support vector machine(SVM)is used to train each balanced data set,sample weights are calculated,and a probabilistic neural network model based on sample weights is established;Then,the results of different classifiers are integrated;Finally,a SSPWNN model for zinc-binding sites from protein was proposed based on support vector machine and sample_weighted probability neural network.Tested on the training set,the new proposed method performs better than the component predictor.Moreover,compared with the other four methods,whether it is the overall prediction performance of the four residues,or the prediction performance of any residue,it is superior to other methods.The prediction ability of four kinds of residues,as well as the prediction ability of any residue,was tested on the independent test set,and the prediction results have been improved on the whole.In addition,many prediction experiments have been done by reducing certain feature,and the scores of the performance indexes are calculated.The importance of the feature attributes selected by this method was analyzed.
Keywords/Search Tags:functional sites, protein-protein interaction sites, zinc-binding sites, prediction, imbalance, integration, machine learning
PDF Full Text Request
Related items