Font Size: a A A

The Study Of Nonlinear Methods For The Prediction Of Protein Structural Classes And Functions And Phylogenetic Analysis

Posted on:2014-06-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:G S HanFull Text:PDF
GTID:1260330401489851Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the development of biotechnology and bioinformatics, biological data haveincreased in exponential way every year. It is not really practical to analyze such massdata alone by performing expensive and time-consuming biochemical experiments. Tomeet such requirement, it is extremely urgent to develop reliable and effective compu-tational methods and algorithms. This thesis study the prediction of protein structuralclasses and functions and phylogenetic analysis based on nonlinear science methods.The detailed work are summarized as follows:In Chapter2, we study about predicting the structural classes of low-homologyproteins. Based on predicted secondary structures, we propose a new and simple k-ernel method to predict protein structural classes. The secondary structures of al-l amino acids sequences are obtained by using the tool PSIPRED and then a linearkernel on the basis of secondary structure element alignment scores is constructed andthen is considered to be a precomputed kernel function for training a support vectormachine classifier without parameter adjusting. The overall accuracies on two publiclow-homolgoy datasets are higher than those obtained by other existing methods. Es-pecially, our method achieves higher accuracies for differentiating the α+β class andthe α/β class compared to other methods. It is concluded that the linear kernel on thebasis of secondary structure element alignment scores better captures the similarity be-tween two secondary structural element sequences than existing statistical informationextracted from predicted secondary structures.In Chapter3, we study the problem of subcellular localizations of proteins. Thefunction of a protein is closely related with its subcellular location. Amino acid com-position is one of important models for subcellular localizations of proteins, but itignores sequence-order information. In order to make up for this deficiency, we addtwo methods, recurrence quantification analysis and Hilbert-Huang transform. Thesetwo methods can extract recurrence patterns and frequency information in time series.In order to make use of two models, we convert each amino acids sequence into twotime series by using hydrophobic free energies and solvent accessibilities of20aminoacids. The ensemble model of amino acid composition, recurrence quantification anal-ysis and Hilbert-Huang transform generate62features. As a result, each amino acidssequence is represented by a62-dimensional feature vector. All features are ranked bythe maximum relevance and minimum redundancy method and support vector machineis still used as classifier. The jackknife test is used to select optimal feature subset, e- valuate and compare our method with other existing methods. Our method is testedon three apoptosis protein datasets. It can be seen from final results that our methodachieves the best performances by using relatively few features. This suggests that ourmethod may complement the existing methods.In Chapter4, we study subnuclear localizations of proteins. Compared with sub-cellular localizations of proteins,subnuclear localizations of proteins are more chal-lenging. A novel two-stage multiclass support vector machine is proposed and is suc-cessfully applied to predict subnuclear localizations of proteins. It only considers thosefeature extraction methods based on amino acid classifications and physicochemicalproperties. In order to reduce computation complexity and feature abundance, we pro-pose a two-step optimal feature selection process to find the optimal feature subset. Inour system, all classifiers are constructed using support vector machine with probabil-ity output. We use the radial basis kernel function, whose parameter is determined byan automatic optimization method to speed up our system. The weight strategy is usedto handle the unbalanced dataset. From the results on three datasets, our ensemblemethod is valuable and effective for predicting protein subnuclear locations comparedwith existing methods for the same problem and is better than popular machine learn-ing classifiers (such as support vector machine, random forest).In Chapter5, we study vertebrate phylogeny based on mitochondrial genomes.The mitochondrial genomes are represented by the chaos game representation (CGR),a tool for DNA sequence representation. Then, two Markov chain models are used tosimulate the CGRs of mitochondrial genomes and are considered as noise backgroundcandidate models. Alignment-free methods are constructed based on two Markovchain models, and are applied to analyze the phylogeny of64selected vertebrates. Fi-nally, we conclude from the results that the second-order Markov chain model is morepowerful than the first-order Markov chain model in simulating the CGR of the mito-chondrial genomes while the CGR simulated by the first-order Markov chain modelare more suitable for modeling the random background and can be subtracted from theoriginal CGRs to enhance the phylogenetic information in the mitochondrial genomes.
Keywords/Search Tags:Low-homology protein, protein structural classes, secondary structure el-ement alignment, subcellular localization, recurrence quantification analysis, Hilbert-Huang transform, maximum relevance and minimum redundancy, subnuclear localiza-tion
PDF Full Text Request
Related items