Font Size: a A A

A Novel Method For Prediction Of Protein Domain Using Distance-based Maximal Entropy

Posted on:2009-07-27Degree:MasterType:Thesis
Country:ChinaCandidate:G X GongFull Text:PDF
GTID:2120360242980236Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Bioinformatics is to meet this challenge and to develop a new type of subject, it is biology, applied mathematics, computer science formed by the cross-cutting disciplines, is the life sciences and natural sciences major frontier area, Science is the core of the 21st century one of the fields.Protein is the implementation of physiological functions, is the direct manifestation of life, of protein structure and function of life will be directly stated in a physiological or pathological conditions change mechanism.Protein forms and the existence of activities, such as post-translational modification, and protein-protein interactions and protein conformation and other issues, still rely on direct research on the protein to resolve. Protein domain is a structural protein levels, and is regarded as protein structure, folding, and function of the evolution of the basic unit. Detection of protein domain issue is a kind of challenge. Domain detection there are many ways, such as from the known protein structure gather that the domain of expert system; information from the Protein structure prediction domain boundary. However, protein structure information is only available space in a small part. With the unknown structure of the sequence number of the rapid growth of information, the correct definition domain, and only by the amino acid sequence predicted domain boundary changes are very important. Some only by the protein sequence information of the method has been suggested that these methods are based on similarity search and multiple sequence alignment to describe domain boundaries. The current research methods including information on protein structure analysis, based on the similarity of search and multiple sequence alignment, hidden Markov model approach, based on expert knowledge, as well as several other methods. In this paper, the use of the domain boundary definition from SCOP database, and through the database by the large number of data analysis, combining the experimental conditions, the choice of homology and a representative of single-chain protein as a research object, in the analysis of the basis of the existing methods , the comprehensive utilization of the evolution of the protein sequence information, with its static characteristics, and through the use of data preprocessing support vector machine learning system for the accurate and rapid domain protein structure prediction border signal goal.Support Vector Machine statistical learning theory is a specific method, which to a large extent in the field of pattern recognition solve some fundamental problems, such as model selection and the learning problems, nonlinear problems with the dimension of the disaster, as well as local minima and so on. Therefore, the statistical learning theory and in recent years has become SVM pattern recognition and machine learning research in the field of hot spots. At present, the application of constraints SVM difficult issues including training slow to identify system parameters such as no theoretical guidance. Support Vector Machine basic idea can be summarized as: First, through the nonlinear transformation to transform the input space will be a high-dimensional space, and then strike in the space of optimal linear classification, and this is by definition nonlinear transformation appropriate function within the plot to achieve. Here, the use of statistical learning theory with the traditional method completely different from the idea that attempts to input space or dimension, to the original question in the high-dimensional space can be linear (or nearly linearly separable), or because the change only after peacekeepers the inner product operations, and not with the complexity of the algorithm so that the number of peacekeepers increased, but also in high-dimensional space in the promotion of capacity is not affected by the dimension. Support Vector Machine in the application of radial-nuclear (RBF nuclear) as a function of training, through the Support Vector Machine k-weight cross-validation study, a grid search to identify the best C andσ~2, this way predicted protein domain in the accuracy of about 80%.Imbalance in the classification of the data sets is the area of pattern recognition and machine learning in the new research, the traditional classification of major challenges, resolving this issue may raise new classification study thinking so as to improve the machine learning system. The so-called imbalance data sets is that the same data on certain types of samples than other types of samples lot more, and more samples of the majority of the samples of a handful of small type, and a small number of samples are often Category the concern. Many practical applications exist in the field of non-equilibrium data sets. However, the imbalance in classifying data sets, the traditional classification of the majority tend to have higher category recognition rate, the minority of the recognition rate has been very low.In order to improve the minority category classification performance, unbalanced classified data sets the solution to the problem is divided into data layers method and algorithm of the method. The method of data collection on the training pretreatment, followed by the processed data sets training classifier. The data layer is also known as the resampling method, divided into oversampling and undersampling. Oversampling method by increasing the small number of samples to increase the minority category classification performance, the simplest way is to copy the sample had a few samples, a few shortcomings is not to add any new categories of information, will learn of the decision-making classification variable domain small, resulting in a learning; owed sampling method by reducing the sample to enhance the majority of the minority category classification performance, the most underdeveloped simple random sampling method is to remove some of the majority of the sample to reduce the size of the majority. This paper is owed used in the sampling method, as a sampling of additional samples of the increase in the number of classification makes the classification of decline, reducing the sample size by a less sampling method.However, due to the imbalance in the SVM data sets on the issue of the classification of poor performance, this paper presents a new sampling method is owed SVM in the space based on the characteristics of the distance of maximum entropy value. Adopt this approach after it SVM learning mechanism in dealing with the issue of unbalanced data sets become an effective method. SVM because only those close to the border of the data (for example, support vector) to build their model. It is worth noting that as SVM based on the kernel function method, which is the classification of features defined in the space, so we can use less of the means to carry out sample pretreatment.This paper applies the same test sets and training set were the traditional SVM and after pretreatment of SVM owed sampling experiments, the corresponding parameters were that the experimental data. The experimental results show that not only owed sample pretreatment reduced the size of the input data, but also reduce the training time, but also improve the accuracy of the classification of SVM, the imbalance in the classification of the data is superior to traditional SVM. This result is the domain boundary detection provides a new method, but also for the study of the structure of biological macromolecules in solution identification provides a new clue.
Keywords/Search Tags:Distance-based
PDF Full Text Request
Related items