Font Size: a A A

Research On Technology Of Computational Biology For Protein Structure Prediction

Posted on:2007-02-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Y HeFull Text:PDF
GTID:1118360212465573Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Exponentially exploding bioinformation data has brought a new multidisciplinary research area--- computational biology, and subsequently new challenges come to the research community on data mining, machine learning and statistical learning. One of major research issues in computational biology is on protein structure prediction based on protein sequence. From the perspective of computer science, this is a classification prediction issue. How to build effective and efficient models for classification problems is a hotspot for researches on data mining, machine learning and statistical learning.With the focus on major issues on the mass data set training, the comprehensibility of prediction, and the improvement of accuracy for active learning, the studies of this dissertation have conducted systematical investigations on the model, methodology and major techniques for the protein structure prediction in computation biology. Furthermore, with the data prediction model with granular-computing proposed in this dissertation and synthetically combining the featured methodology and techniques from data mining, machine learning and statistical learning for the classification prediction,the studies are intended to form a comprehensive and systematical framework for the biological data classification and prediction, and based on the framework a type of novel intelligent prediction systems with high adaptivity, comprehensibility and efficiency can be developed in the future.The main contributions of the dissertation are as follows:1. A novel data prediction model, Support Vector Machine with Granular Computation (SVM_GC), is proposed for processing the complex prediction problems with mass biological data. Combining and utilizing the theory of granular computing, clustering algorithm and advanced statistical learning methodology, a SVM_GC is built specifically for each of information granules that are partitioned intelligently by the clustering algorithm. This feature makes learning tasks for each SVM_GC more specific and simpler. Moreover, SVM_GCs built particularly for each of granules are optimal for parallel processing so that mass data can be divided to conquer and multi-classification issue from mass data processing can be efficiently solved as well.2. For the interpretation issue of prediction, a new approach is presented for rule generation for protein structure prediction by integrating advantages of both Support Vector Machine (SVM) and decision tree. This approach combines the generalization of SVM and the comprehensibility of decision tree into a new algorithm called SVM_DT. The results of the experiments show that the interpretation of SVM_DT is much better than that of SVM, and the generalization of SVM_DT is better than that of decision tree. The most important is that the rules reveal the significant biological meaning and thus can be used to guide the"wet experiments".3. Due to the fact that a large number of rules are difficult for researchers to interpret and analyze, a new approach of rule clustering (C_SuperRule) is presented for super-rule generation. By using K-means clustering algorithm, a large number of rules can be clustered based on similarity and then the rules for each cluster can be aggregated to generate new super-rules. These super-rules represent the consensus rule pattern and the essential underlying relationships among classes. The super-rules coming from each of clusters makes it easier for researchers to understand the general trend and ignore the noise which could be made by a single rule. Also, it allows researchers to be able to not only interactively focus on the key aspects of the domain by using super-rules, but also selectively review the original detailed rules from the corresponding cluster. Therefore, it will be convenient for researchers to analyze and make use of rules.4. For reducing the effects of noises and outliers for prediction, a new active learning model is proposed based on the genetic algorithm (GA) and the weighted scheme by surprising patterns. By assigning weights based on surprising patterns to input points, each of input points will make different contribution to the learning. The genetic algorithm is used to optimize the parameters for search and the parallel GAs are implemented on cluster to speed the progress of learning. By the active learning of SVM, the proposed model can enhance the SVM in reducing the effects of noises and outliers. The experimental results from the protein structure prediction demonstrate that the model is effective and promising.
Keywords/Search Tags:Data Mining, Machine Learning, Statistical Learning, Granular Computing, Support Vector Machine, Computational Biology, Protein Structure Prediction
PDF Full Text Request
Related items