Font Size: a A A

Application Of Fourier, Wavelet And Recurrence Quantification Analyses To Classification Coding And Non-coding Sequences And Protein Structure Classes

Posted on:2008-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhouFull Text:PDF
GTID:2120360218457911Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
This thesis comprises two main parts, each part studies one basic problemin Bioinformatics. One problem is to distinguish the coding sequences and thenoncoding sequences, the other is to classify the four protein structure classes,all-αproteins, all-βproteins,α+βproteins andα/βproteins. Then, we use thelinear classifier to give the discriminant accuracies of the methods we give in thisthesis.For the problem of distinguishing the coding and noncoding sequences, Fouriertransform method is proposed to distinguish coding and non-coding sequences ina complete genome based on a number sequence representation of the nucleotidesequence proposed by our group[65] and the imperfect periodicity of 3 in pro-tein coding sequences[15]. Three exponents Px((?))(1), Px((?))(1/3) and Px((?))(1/36) inFourier transform of the number sequence representation of coding or noncodingsequences are selected to form a parameter space. Each coding or noncoding se-quence may be represented by a point in the three-dimensional parameter space.We can see the points corresponding to coding and non-coding sequences in thecomplete genome of many prokaryotes be divided to different regions roughly. Ifthe point (Px((?))(1), Px((?))(1/3), Px((?))(1/36)) for a nucleotide sequence is situatedin the region corresponding to coding sequences, the sequence is discriminated asa coding sequence; otherwise, the sequence is classified as a noncoding one. TheFisher's discriminant algorithm is used to give the discriminant accuracies. Theaverage discriminant accuracies pc, pnc, qc and qnc of all 51 prokaryotes obtainedby the present method reach 81.43%, 92.05%, 81.07% and 91.87% respectively[67].For the problem of classifying the four protein structure classes, we attemptto solve it with two different ways. We use recurrence quantification analysis(RQA) to study the 3-dimensional coordinates of alpha-carbon atoms of proteinfor classing protein structures. Then we also get three parameters, %determ1,%determ21 and %determ22, to construct a 3-dimensional parameter space. Inorder to give a quantitative assessment of our clustering on the selected proteins,Fisher's discriminant algorithm is used to distinguish the four structures fromothers one by one. Numerical results indicate that the discriminant accuracies arevery high and satisfactory[66]. On the other hand, we use local H(o|¨)lder exponentsto capture local patterns in protein sequences. The number sequence represen-tation of a protein based on a 6-letters model of amino acids[9] is considered asa time series, then local H(o|¨)lder exponents of it are estimated. The probabilitydensity distribution of local H(o|¨)lder exponents is then calculated. We can take the probability density values as features fbr a perceptron constructed by NeuralNetwork Toolbox in Matlab to classify proteins from all-α, all-β,α+βandα/βfour protein structure classes. All selected 49 large proteins can be classified 100%correctly[68].
Keywords/Search Tags:coding/noncoding sequences, Genome, Fourier transform, protein structure classes, RQA, wavelet transform, local Ho|¨lder exponent
PDF Full Text Request
Related items