Font Size: a A A

Research On Algorithm In Feature Extraction Of Protein Classification

Posted on:2007-07-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z H ZhangFull Text:PDF
GTID:1100360215470575Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the success of human genome project, a widening gap appears between sharply increasing known protein sequences and slow accumulation of known protein structures and functions. It is urgent to find a trustworthy theoretical and computational approach to predict protein structures and functions from immensurable sequences, which is a kernel task of bioinformatics in the post-genomic era.Since the great diversity of protein structures and functions, it is difficult to capture the important features of them with any simple classification scheme. There are many specialized ways of grouping proteins, each of which has been helpful for some fields. As an offshoot of the research of proteomics, protein classification has been focused on with more and more attentions. Any new breakthrough in this research will be helpful to further understand the structure and function of protein. What's more, it plays an im-portant role in molecular biology, cellular biochemistry, pharmacology and medicine etc.Feature extraction of protein sequence is a basic problem in the research of protein classification, and also a key factor of the classification performance. This thesis studies some algorithms in this subject, proposes four new feature extraction algorithms for four basic types of problems in the research of protein classification, and takes some testing and analysis for these algorithms based on the standard dataset. The main work and the creative achievements in this thesis are shown as followed:1. Protein structural class is very important to the protein structure prediction. To protein with unknown structure, it will lead to the increase of secondary structure pre-diction accuracy, and also lead to the decrease of the complexity of protein tertiary structure prediction, if the structural class is clear. Based on the concept of measure of diversity, k-substring diversity source is presented. Combined with the increment of di-versity algorithm, the new feature extraction approach is applied to protein structural class prediction. For the dataset T359, the overall accuracy of SS+Diver model in Jack-knife test is 97.49%, about 1.67~56.27 percentile higher than that of other existing models.2. To understand the structure and function of a protein, an important task is to identify the quaternary structure for a new polypeptide chain, i.e., whether it is formed just as a monomer, or as dimer, or any other oligomer. Thus, a computational method for properly classifing the quaternary structure of proteins would be significant in inter-preting the original data produced by the large-scale genome sequencing projects. Three different composite feature extraction methods are raised and applied to protein quater-nary structure prediction combined with the nearest neighbor algorithm. The simulation results show that the performances bsed on DPC_ACF are higher than that of other composite methods. For the dataset RG1639, the overall classification accuracy of DPC_ACF in Jackknife test is 90.2%, about 2.7%~31.3% higher than that of other ex-isting models. For the dataset CC3174, the overall classification accuracy of DPC_ACF in Jackknife test is 91.18%, about 12.68%~22.78% higher than that of the best existing model.3. Apoptosis proteins play an important role in the growth and homeostasis of or-ganism. Functions of those proteins will be helpful to make clear the mechanism of programmed cell death. The knowledge of the subcellular location of apoptosis protein is important to understand the function of apoptosis protein. Based on the idea of coarse-grained description and grouping, a new approach named as encoding based on grouped weight (EBGW) for protein sequence is presented. Combining with the com-ponent-coupled algorithm, the nearest neighbor algorithm and support vector machine respectively, three classification models (named as EBGW+CCA, EBGW+NNA and EBGW+SVM) are put forward, and applied to the subcellular location prediction of apoptosis protein. Experiments show that, for the same dataset, with the same classifica-tion algorithm, the capacity of feature extraction from EBGW approach excel that from amino acid composition and instability index. The overall classification accuracy, sensi-tivity and Matthews'correlation coefficient of each class from EBGW+SVM model are all higher than those of existing models.4. Membrane proteins are very important in a cell, and can be relatively easily dis-criminated from non-membrane proteins. The determination of functions for new mem-brane proteins can be expedited significantly if we can find an effective algorithm to predict their types. Based on the concept of sub-alphabet, sub-polypeptide composition of protein sequence is presented. The new algorithm not only contains more cellular in-formation of protein sequence, but also greatly decreases the computation complexity. Consequently, for the dataset CE2059, the overall classification accuracy of model with sub-polypeptide composition is 0.1% higher than that of model with traditional poly-peptide composition. Even more, the computation time of our model is only 11.75% of that of the latter. Compared with existing models, the overall classification accuracy in-creases about 1.02~25.16 percentile in the Jackknife test.5. In the end, relation between the performance of classification model and the characteristics of training dataset is simple discussed.
Keywords/Search Tags:protein structure and function, bioinformatics, protein classification, feature extraction, protein structural class, quaternary structure, apoptosis protein, subcellular location, membrane protein
PDF Full Text Request
Related items