Font Size: a A A

Analysis And Prediction Of Rna-binding Residues In Protein Molecules

Posted on:2013-01-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:L ChaFull Text:PDF
GTID:1110330374460957Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
Protein-RNA interaction plays an important role in many biological processes,such as RNA splicing, translation, protein synthesis and posttranscriptional regulation.Therefore, identification of RNA-binding residues in proteins provides valuableinformation for understanding the mechanisms of Protein-RNA interaction.The present approaches to study protein-RNA interaction can be divided intoexperimental methods and bioinformatics methods. The experimental methods, suchas x-ray crystallography or nuclear magnetic resonance, can be applied to deduce thecrystal structure of Protein-RNA complex based on which the RNA-binding residuescan be found. The advantage of experimental methods is that the result is reliable.However, the processes to obtain the crystal structure of Protein-RNA complex is atime-consuming, and sometimes, it is difficult to get the crystal structures for someProtein-RNA complexes.With the increasing of structure data of protein-RNA interaction, researchers havebeen trying to find RNA-binding residues through bioinformatics methods, which aremainly classified into three categories, structural domain methods, moleculardynamics simulations and machine learning methods. The core idea of the structuraldomain methods is to find RNA-binding residues by searching the position ofRNA-binding domain in protein structure databases such as SCOP. However, thismethod can only be used for those proteins that have been determined RNA-bindingdomain. In addition, the mechanism of RNA-binding domain is not very clear yet.Sometimes, residues in RNA-binding domain will interact with other regions of RNAinstead of target region, even with other proteins. Another way of findingRNA-binding residues is molecular dynamics simulations. By simulation, we canobserve the whole binding progress and determine the change of energy andconformation during the progress. The first drawback of simulation methods is that itis a long time job, and only available for small systems. The second one is that thecorrectness of simulation is affected by parameter setting. Sometimes, it is verydifficult to find out the optimized parameters. However, with the accumulation oflarge amount of structure data, it becomes possible to find RNA-binding residues bymachine learning methods, and some models have been proposed to predictRNA-binding residues recently. Though fully analyzing those models, we found thatthere existed some shortcomings of those models as follows. Firstly, the number oftraining samples is small, which may lead to a bias result. Secondly, the number offeatures is small,some works only considered several features, which may misssome important key variables. Thirdly, some models are developed using the featuresextracted from3D structure data, or complex physical chemistry features, whichcannot be applied to those protein sequences without3D structures.To solve these problems,we need a prediction model satisfying the following characteristics:1. The model should be developed based on a big dataset to avoidbias;2. In order to improve the prediction performance, more features should beextracted; and3. Features that are selected to develop the prediction model should bederived only from sequence information. To this end, we have developed the models.Firstly, we extracted532Protein-RNA complex samples from PDB databasereleased before June,2011. These complexes were derived from x-ray crystallographywith the resolution greater than3, and only contain protein and RNA sequences.After removing90samples, which have a RNA chain shorter than4nucleotides orhave mistakes in sequence data, we get a dataset contains429samples, which contain1970protein sequences and823RNA sequences. In order to reduce data redundancy,protein sequences are clustered into429groups by BLASTClust with sequenceidentity above25%. The first sequence of each group is selected as the representativeof this group. After that, we get429non-redundant protein sequences, which contain90735amino acid residues.The binding sites are defined by distance between atoms: if one of the atoms ofan amino acid residue falls within a cut off distance of3.5from any atoms of RNAmolecule in the complex, the residue is designated as a binding site. In the datasetconsisting of90735amino acid residues, we find10525binding residues and80210non-binding residues.After defined the binding sites, each amino acids residue is characterized by nineclasses of features:①the number of atoms;②the number of electrostatic charge;③the number of potential hydrogen bond;④side chain pKa value;⑤hydrophobicindex;⑥relative accessible surface area;⑦secondary structure;⑧smoothed PSSM;⑨classification of amino acids based on dipole moment and side chain volume.Finally, we applied TClass program to select features and construct predictionmodel by combining Na ve Bayes classification methods and forward featureselection strategy. Furthermore, attribute bagging method is used to improve classifierperformance. Test on independent dataset shows that the classifier achieves83.86%overall accuracy with83.32%sensitivity and80.55%specificity. A case study ofXlrbpa protein shows that, there is a good overlap between the positions predicted byour model and those determined by RNA-binding domain.By analyzing the relationship between propensities of amino acid usage and thefeatures, we get the following results:①RNA shows a strong bias on amino acidselection, the occurrence number of most popular amino acid is38times than themost unpopular amino acid.②Hydrophilic amino acid is more popular thanhydrophobic amino acid. The occurrence number of hydrophilic amino acid is4.38times higher than hydrophobic amino acid.③Positively-charged polar amino acid ismore popular than non-polar amino acid.④The amino acid residue, whose dipolemoment is bigger than3.0debay and side chain volume is bigger than503, is morepopular with nucleotides. The amino acid whose dipole moment is bigger than3.0debay but has opposite orientation is unpopular with nucleotides.Based on the prediction model we developed, we build an online predictionserver called RBRPre that powered by MATLAB Builder JA, MySQL and JSP. Usercan visit the website and input a protein sequence. Then, the prediction result will be sent to user via Email. In order to avoid crash caused by high-concurrence visit, werealized a queue scheduling algorithm by MySQL and crontab.In summary, based on a big dataset of Protein-RNA complex and lots of features,we developed a RNA-binding residue prediction model and analyzed the relationshipbetween propensities of amino acid usage and the features. Test result on independentdataset and the case study of Xlrbpa protein show that the prediction model achievesgood performance.Based on our work, we can get these results:①This work makes it possible toget the RNA-binding residues only by sequence information.②This work providesvaluable information for understanding the mechanism of Protein-RNA interactionthrough the analysis of relationship between propensities of amino acid usage and thefeatures.③By construction the online prediction server RBRPre, this work providesa better bioinformatics support for searching RNA-binding sites in protein,and speedup the progress of related experiments.The innovation points in this paper lie in:①With this model, researchers can getRNA-binding residues in proteins based only on sequence information. The onlineprediction tool, RBRPre, provides an easy-to-use service for relevant researchers.②Based on a big dataset and lots of features, we can get a reliable result with out bias.③The bias of amino acid selection on RNA-binding sites is analyzed in this paper,The relationship between amino acid features and RNA-binding bias is also analyzed.
Keywords/Search Tags:Protein-RNA interaction, binding site, prediction model, bias
PDF Full Text Request
Related items