Font Size: a A A

Research On Prediction Of Protein-protein Interaction Sites And Subcellular Localization

Posted on:2018-01-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:G H LiuFull Text:PDF
GTID:1310330542955003Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Proteins are the material foundation and embodiment of life action.Study on the structure and function of protein and protein-protein interactions is very useful for both understanding of disease mechanisms and novel drug design.On account that the traditionally experimental methods can not meet the needs of mass protein analysis in the research of protein related functions,the research methods of bioinformatics based on machine learning have been paid more and more attention.Protein-protein interactions and protein subcellular localization are important research contents of proteomics and protein function.With the increasing complexity of protein data,emerging new problems need to find new solutions,and a variety of new challenges to the prediction algorithm put forward higher requirements.Under the above background,from the perspective of bioinformatics and machine learning,this paper makes an intensive study on the prediction of protein interaction sites(PPIs)based on sequence and the prediction of protein subcellular localization based on the image.From the point of view of pattern recognition,the prediction of protein interaction sites is a kind of binary classification problem,and the prediction of subcellular localization is a multi-classification problem.Therefore,first,this paper discusses the use of different methods to deal with the key problem of imbalance of two classes in the prediction of protein interaction sites based on sequence;then,in the prediction of protein subcellular localization based on the image,a new feature extraction method and a new classifier algorithm are proposed to classify multiclass subcellular localization types.So as to realize the further research from binary classification to the multi-classification.The main work in this dissertation is as follows:(1)A data-cleaning and post-filtering procedures method was proposed.Being aimed to at class imbalance in the prediction of sequence-based PPls,in this method,a random forests based data cleaning procedure is applied to remove those marginal targets,which potentially have negative impact for training a model with clear classification boundary,from the majority samples to relieve the severity of class imbalance in original training dataset;then,a prediction model is trained on the cleaned dataset;finally,an effective post-filtering procedure is further used to reduce the potential false positives of predictions.Stringent cross-validation and independent validation tests on benchmark datasets demonstrated the efficacy of the proposed method,which achieves very competitive performances to existing state-of-the-art sequence-based PPIs predictors and will supplement to existing PPIs prediction methods.(2)A clustering down-sampling method based on Doppler Effect Bat Algorithm(DEBA)was proposed.In the prediction of sequence-based PPIs,random down-sampling easily leads to the loss of important information.This method intends to ensure the integrity of the sample distribution information through a clustering down-sampling method based on DEBA.Firstly,the minority class samples(interactive residues)and the majority samples(non-interactive residues)in the training samples are separated;then,the DEBA algorithm is used to cluster the majority samples.Several clusters are obtained,and their sample centers are calculated.Down-sampling is achieved by selecting the majority of samples from the different cluster centers with the same number of samples.This cluster sampling method not only preserves the distribution of the original sample information,but also reduces the degree of imbalance between two classes.The detailed flowchart of the prediction model was given as well.Elaborate experiments on benchmark datasets demonstrated that the proposed method is effective to deal with class imbalance.(3)A prediction method of protein subcellular location based on multi-view image feature was proposed.In this method,some image features based on multi-view not introduced in the existing methods are used to extract the features of protein and DNA in immunohistochemistry(IHC)images,and the original and segmentation features are combined to improve the performance of subcellular localization prediction.From three multi-views,pure image features,including four texture features of the original image,the global and local features of the protein extracted from the protein channel image after color segmentation,and the global features of DNA extracted from the DNA channel image,were extracted.Finally,the extracted features are combined respectively,and the feature selection is performed using stepwise discriminant analysis(SDA)method.Stringent 10-fold cross validation tests on the benchmark data set are performed by using different classifiers with pure feature sets and combined feature sets respectively.From the successive tests,the best combined features and the best classifier can be obtained.In this paper,a classifier based on Stacked Auto Encoder(SAE)and random forest is also proposed.The multi-level network is combined with the traditional statistical classification method to improve the prediction results.Stringent cross-validation and independent validation tests on benchmark datasets demonstrated the efficacy of the proposed method,which achieves very competitive performances to existing state-of-the-art image-based predictors of protein subcellular location...
Keywords/Search Tags:sequence-based prediction, protein-protein interaction sites, class imbalance, image-based prediction, protein subcellular location, image features, classifier
PDF Full Text Request
Related items