Font Size: a A A

Prediction Of Protein Structure And Function With Machine Learning Methods

Posted on:2007-09-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:1100360242961684Subject:Systems analysis and integration
Abstract/Summary:PDF Full Text Request
The ever-expanding biological experimental data makes how to turn the data to knowledge a very challenging and interesting problem. The combination of biology and information science brings out bioinformatics. The accumulating rate of the protein sequence information is far quicker than that of the protein structure information. Thus, people hope to be able to predict protein structure from sequence. But it is not enough for only knowing the protein structure. The final aim of protein structure prediction is to know protein function. The predictions of protein structure and protein function are two main points in bioinformatics. We have used machine learning methods to study several problems in protein structure and function prediction. The contribution is as follows.A novel method based on support vector machine (SVM) has been developed to predict hydrogen-bondα-turns. The contributions of multiple sequence alignment generated by PSI-BLAST and predicted secondary structure on the prediction performance have been discussed. Results have shown that the present method has shown better performance the current best hydrogen-bondα-turns prediction method AlphaPred. An online web server named AlphaTurn based on this method has been developed. In addition, three schemes for handling a highly unbalanced data set have been compared.We first predictα-turns defined as pentapeptides of which the distance between Cαof between residue i and residue i+4. The proposed method performs well. Both multiple sequence alignments information and predicted secondary structure information contribute to the prediction performance. With multiple sequence alignment and predicted secondary structure, the final SVM model yielded a Matthews Correlation Coefficient (MCC) of 0.451 by a seven-fold cross-validation..A robust method designed for the prediction ofπturns was first developed. With multiple sequence alignment and predicted secondary structure, the final SVM model yields an MCC value of 0.556 by a seven-fold cross-validation. We also noticed that multiple sequence alignment contribute more to theπ-turns prediction than to theβ-turns prediction. As a result, the accuracy level achieved forπ-turns is better than that of the best-predicted tight turnsβ-turns, although the dataset used in this work is more unbalanced than the datasets used in theβ-turn prediction method. The newly computed positional potentials can be applied in modeling and design ofπ-turns in proteins.We developed a robust method for the prediction of protein residues that interact with RNA using SVM and position-specific scoring matrices (PSSMs). Two approaches have been considered in the prediction of protein residues at RNA-binding surfaces. One is given the sequence information of a protein chain that is known to interact with RNA; the other is given the structural information. Coupled with PSI-BLAST profiles and predicted secondary structure, the present approach yields an MCC value of 0.432 by a 7-fold cross-validation, which is the best among all previous reported RNA-binding sites prediction methods. The multiple sequence alignment information contributes much to the prediction performance. When given the structural information, further improvement has been obtained.We applied a new measure of information discrepancy to identifyingβ-barrel membrane proteins from other globular and membrane proteins. When the subsequence length is 2, our approach can correctly recognizeβ-barrel membrane proteins with 91% accuracy in the 10-fold cross-validation tests. Meanwhile, the accuracy for picking up globular proteins is up to 86%. Furthermore, the present approach can correctly excludeα-helices membrane proteins with 89% accuracy. Another interesting finding is that the method can still preserve high predicting accuracy when reduced sets of amino acids of 15, 12, and 10 are used. This suggests that the minimum number of letters needed for discriminating betweenβMPs and GPs is around 10. When evaluated on the same datasets, our method outperforms earlier methods in both overall prediction accuracy and MCC value.
Keywords/Search Tags:α-turns, πturns, protein-RNA interaction, β-barrel membrane protein, Machine learning, Support Vector Machine, information discrepancy
PDF Full Text Request
Related items