Font Size: a A A

The Study Of Characterization And Prediction Of Binding Sites On Proteins Based On Machine Learning Methods

Posted on:2012-04-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y XiongFull Text:PDF
GTID:1118330368483856Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the accomplishment of genome sequencing projects of human and other species, the increasing availability of genome sequencing data provides sufficient encoding information for hundreds of thousands of proteins. As the production of genetic information, proteins are the carriers of the most important biological activities and the executors of cellar functions. In biological cells, proteins perform specific functions when they interact with other molecules. However, only a part of residues on proteins are directly participating the interaction with other molecules. The interacting residues play the crucial roles in various biological functions. Therefore, the characterization and identification of functional residues or binding sites provides important clues for exploring the function of proteins.In the last decade, researchers have been focusing on the development of computational methods to predict functional residues on proteins. Especially, the machine learning-based methods are applied to the prediction of binding residues from sequence or structure-derived features. In our dissertation, we first exploit amino acid indices to analyze the physicochemical attributes specific to the different types of molecules (such as protein, DNA/RNA and heme) binding to proteins, and we propose a new classification method to predict heme binding residues from heme binding pretein sequences. More impoartantly, we mainly explore and design effective structural and topological features to characterize and predict DNA-binding residues. The outline of the research topics is listed as following:1. We exploit amino acid indices to analyze the physicochemical attributes specific to the different types of molecules (protein, DNA/RNA and heme) binding to proteins, and propose a new sequence-based method to predict heme binding residues. Our results have been shown that the different types of binding residues have their own relevant attributes. We first propose an intuitive feature selection scheme and a novel integrative sequence profile, which is generated by coupling the PSSM with the selected physicochemical properties. Evaluation experiments by using 5-fold cross validation on the training set and on the independent test demonstrate that our proposed approach outperforms the conventional methods based on PSSM profiles for prediction of heme binding residues.2. The feature design and analysis of DNA-binding residues in the prediction models. In the section, we first build the benchmark datasets, which consist of DNA-binding protein structures both in their holo and apo forms. Then, we introduce the novel features such as temperature factor, packing density and betweenness centrality, to descible DNA-binding residues on bound and unbound structures. The statistical results derived from the new features can provide useful information and knowledge to molecule biologists.3. We propose a new method using the stradegy based on dimensionality reduction to predict DNA-binding residues. In the previous section, the methods for predicting DNA-binding residues included data for neighboring residues by concatenating a number of properties, resulting in highdimensional feature vectors. To overcome the limitations, we first introduce a novel weighting factor to quantify the distance-dependent contribution of each neighboring residue in determining the location of a binding residue. Then, a weighted average scheme (dimensionality reduction) is proposed to represent the surface patch of the considering residue. Based on the above strategies, we exploit a reduced set of weighted average features to improve prediction of DNA-binding residues from structures. Experimental results indicate that our approach can predict DNA-binding residues with high accuracy and high efficiency using a reduced set of weighted average features, and compares favorably to the two previous methods. We believe that the weighted average scheme can potentially be expanded to predict other functional sites, such as protein-protein and protein-RNA interaction residues.
Keywords/Search Tags:protein-other molecules interaction, DNA-binding residues, heme binding residues, temperature factor, betweenness centrality, weighting factor, dimensionality reduction, suport vector machine
PDF Full Text Request
Related items