Font Size: a A A

Predicting Protein Protein Interactions And Its Active Sites Based On Data Mining Algorithm

Posted on:2012-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y N ZhangFull Text:PDF
GTID:2210330362959213Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
With the growing development of high-throughput sequencing technology, the amount of protein sequences appears a exponential growth trend. However, their function and interactions are still unknown. What becomes a pressing issue is analyzing protein characteristics and its interactions more quickly and annotating these active sites and their function effectively. On the other hand, the rapid development of computer technology has provided a solid substantial foundation for molecular biology research. Since the research related to protein-protein interactions and active sites face massive data, analysis of these data and further revelation of the natural laws behind these data have become a cutting-edge research field in proteomics and computational biology based on data mining theory.In the last decades, a great many of data mining methods for analyzing protein sequences have been proposed, which have been receiving lasting concern from researchers. In this dissertation, we have developed more effective data mining methods for accurately predicting protein protein interactions and its active sites. Furthermore, we have provided corresponding independent algorithm package and online websites related to our innovative algorithms. The work and novelties in this dissertation include:Propose a novel approach to predict protein protein interactions based on compressive sampling algorithm. Firstly we extract distinctive features from protein sequences. Then the original high-dimensional protein sequential feature vector is compressed into a much lower but more condensed space taking the sparsity property of the original signal into account. We have also compared the compressive sampling method with other traditional dimension reduction method and demonstrated the efficiency of this method. Then we constructed support vector machine and rotation forest models in compressed feature vector domain and verified that these models could effectively avoid overfitting phenomenon. Finally, we discussed the impact of imbalance dataset and different negative dataset construction strategy.Propose a novel approach to perform protein active sites prediction through bi-profile sampling and jack knife test. Firstly we extract sequence conservation features and further preprocess these features in order to avoid overfitting phenomenon. Then we exploit bi-profile sampling method to perform re-coding for amino acid composition, protein secondary structure, protein disorder information as well as solvent accessibility of amino acids. Finally, we compared models performance in predicting protein active sites based on different features combination, different algorithm, and different ensemble strategy. Meanwhile, we also studied the robustness of these models in the case of unbalance dataset.
Keywords/Search Tags:Protein-protein interactions, protein active sites, support vector machine, rotation forest, ensemble learning, feature selection and extraction, compressive sampling, Bi-profile sampling
PDF Full Text Request
Related items