Intelligent Computing Based Analysis And Prediction Of The Solvent Accessibility And Function Of Protein Residues

Posted on:2018-06-16

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J Zha

Full Text:PDF

GTID:1310330515471667

Subject:Intelligent Environment Analysis and Planning

Abstract/Summary:

Protein structure determines its function. The analysis and prediction of protein structure is the basic, and one of the most important parts of the research related to proteins. The solvent accessibility of protein residues is the basic information of protein structure. It plays crucial role in analyzing protein spatial conformation, modelling protein 3D structure, predicting the interactions between proteins and other molecules and studying the metastasis and evolution of itself. The interaction between protein and other molecules is one of the most prevalent methods by which proteins express their function. The analyses and predictions of protein functional residues contribute to the research of protein function.The traditional methods which are used to extract protein structures and function information are all biophysics-related or biochemistry-related technologies. These technologies need expensive experimental instruments, complex experimental procedures and elaborate human resource. This will benefit from the development of bioinformatics, which uses intelligent computing methods to accurate predict protein structure information and functional residues. Actually, only less than 2‰ proteins have accurate 3D structure information on certain segments. With the avalanche of hundreds of thousands of unknown proteins, intelligent computing based methods is becoming more and more popular since they could provide informative valuable clues for specific experiments and especially for their high effectiveness and accurate predictions.This thesis mainly focus on the analysis and prediction of protein basic structure and functional residues, the main content is as follows:(1) This thesis proposes a method to predict the solvent accessibility of protein residues which uses the weighted sliding window and particle swarm optimization algorithm. Firstly, we extract five types of sequence-derived features to encode every residues in proteins and its neighbors. Then, to accurate quantify the influence of adjacent residues, we propose weighted sliding window based scheme to quantify the weights for different positions of the window.Next, particle swarm optimization algorithm is used to search optimal parameters of support vector regression. Compared with previous studies, our method significantly improves the prediction results. We also compare different regression algorithms based on the main datasets,evaluate the prediction results of various parameters optimization methods, and analyze the source of regression error and mean error level of 20 amino acids. We also compare the proposed method with previous studies on benchmark datasets. The results show that our method is able to produce more accurate solvent accessibility values. To test the generalization of the proposed method, we compile a new independent dataset and compare our method with stat-of-art predictors. The prediction performance proves that the proposed method is featured by its good generalization and effectiveness.(2) This thesis proposes a novel method to predict conformational B-cell antigen determinant residues and potential epitopes. This method is based on cost-sensitive ensemble learning and spatial clustering algorithm. Firstly, five sequence-based features are used to encode antigen residues. These features include evolutionary conservation, secondary structure,protein disorder, dipeptide composition and physicochemical properties. To reduce the calculation time as well as remove redundant features, we use Fisher-Markov Selector to calculate the correlations between each features and the labels, and then adopt incremental feature selection strategy to search optimal feature subset based on the ranked features.Additionally, the prediction of conformational B-cell antigen determinant residues is a typical imbalance data classification problem. That is, the number of antigen determinant residues is much less than that non-determinant residues. Traditional machine learning methods are invalid in facing imbalance problem because they are designed and optimized on approximate balance dataset. Considering this, we introduce cost-sensitive ensemble learning method in this study.We observe that the majority of epitopes are spatial clustered. Therefore, we adopt spatial clustering algorithm to predict potential epitopes based on the predicted antigen determinant residues. Experiments on the benchmark datasets proves the effectiveness of the proposed method when compared with previous studies. The independent test indicates that the generalization of the proposed method is excellent.(3) We propose a novel structure-based method to accurate identify heme binding residues by using fast-adaptive ensemble learning algorithm and ligand-specific scheme. Firstly,according to the characteristics of heme proteins, we calculate the features of amino acid composition, the features of motifs, the features of surface preferences and the features of secondary structures. Furthermore, we also investigate the intrinsic attributes of heme binding residues and regions. We find that, heme binding residues as well as regions are enriched in Cysteine and Histidine. Additionally, heme binding residues prefer to locate in the relative small cavities of the heme protein surface. We also find that heme binding residues tend to cluster at the edge of secondary structure segments. Obviously, the identification of heme binding residues is a typical imbalance classification problem, because the number of heme binding residues are much less than that of non-heme binding residues. This thesis propose a novel fast-adaptive ensemble learning algorithm, which dynamically monitoring and adjusting the ratios of the majority to the minority samples. Especially, the ligand-specific strategy is introduced to enhance the prediction performance of two prevalent heme ligands (namely HEC and HEM).The benchmark test and independent test proves the effectiveness of the proposed method. We also investigate the influence of the ratio of the majority to the minority samples. Our new proposed fast-adaptive ensemble learning algorithm performs well on general imbalanced data.For the severe imbalance data, the fast-adaptive ensemble learning algorithm only slightly improves the prediction results when compared with other ensemble algorithms. The proposed method has been implemented as a public available web-server, which will provide convenience for biologists.

Keywords/Search Tags:

Intelligent computing, machine learning, particle swarm optimization algorithm, solvent accessibility, conformational B-cell epitope, heme protein

Related items

1	Research On Protein Complex Accurate Recognition Based On Machine Learning
2	Research On Particle Swarm Optimization Algorithms And Applications For Some Optimization Problems
3	Study On Precipitation Prediction Model Of Extreme Learning Machine Based On Intelligent Optimization Algorithm
4	Research On Multi-meteorological Phase State Inversion Method Based On Machine Learning
5	Study On Smart Prediction Method Of Slop Stability Based On Hybrid Kernel Extreme Learning Machine Trained And Optimized By Particle Swarm Optimization
6	Research On Protein Complexes Identification Algorithm Based On Intelligent Optimization Algorithm
7	The Study Of Protein Amino Acid Residues' Solvent Accessibility Prediction And Gene Expression Profile Analysis
8	Application Of Swarm Intelligent Optimization Based Machine Learning In Landslide Deformation Prediction
9	Prediction Of Protein Solvent Accessibility Based On All-atom Encoding
10	Phage Display Peptides Pretreatment And The B-cell Epitope Prediction Study