Font Size: a A A

Analysis And Comparison For DNA And Protein Sequences

Posted on:2004-05-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:P A HeFull Text:PDF
GTID:1100360095455236Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
DNA(deoxyribonucleic acid), RNA (ribonucleic acid), and protein are all macromolecules which are unbranched polymers built up from smaller units. In the case of DNA, these units are the four nucleotide residues A (adenine), C (Cytosine), G (guanine) and T (thymine) while for RNA, the units are the four nucleotide residues A, C, G and U (uracil). For protein, the units are the twenty amino acid residues A(alanine), C(cysteine), D(aspartic acid).E(glutamic acid), F(phenylalanine), G(glycine), H(histidine), I(isoleucine), K(lysine), L(leucine), M(methionine). N(asparagine). P(proline), Q(glutamine). R(arginine), S(serine), T(threonine), V(valine), W(tr-yptophan) and Y(tyrosine). Thus, a DNA (RNA) sequence can be identified with a word over the alphabet M = {A.C,G,T(U)} and a protein sequence can be taken as a string of twenty letters. To a considerable extent, the chemical properties of DNA, RNA and protein molecules are encoded in linear sequences of these basic units. So, the tools and methods in Combinatorics on Words and Statistics will play important roles in studying line sequences of biomolecular units.The main contents are listed as follows:In chapter 1. based on the ideas of homomorphism in algebra and coarse-graining in physics, we introduce the concept of the characteristic sequences of a DNA primary sequence according to the classfications of chemical structure of four nucleotide residues A, C, G and T. The characteristic sequences of a DNA primary sequence are a group of (0,1) sequences , each of which is a reduced representation of the given DNA primary sequence, and two of which can uniquely reconstruct the primary sequence. By counting all (0.1) triplets of characteristic sequences, we construct a set of 2 x 2 matrices to represent a DNA primary sequences. Furthermore, the leading eigenvalues of these matrices are computed and considered as a kind of invariants for the DNA primary sequences. Similarity and dissimilarity analysis based on invariants of DNA primary sequences are given for eight exon-1 genes of β-globm about eight species: human, goat, gallus. opossum, lemur, mouse, rabbit and rat. In addition, through comparison of characteristic sequences, we try to find the biological functions of purine-pyrimidine, amino-keto groups and weak-strong H-bonds, respectively.In chapter 2, we present an application of the characteristic sequences of DNA primary sequences in gene recognition of genome. First, we suggest a numerical description of the characteristic sequences. Based on this description, a new protein coding gene finding algorithm specific for the yeast genome at better 95% accuracy was suggested. Furthermore, applying the algorithm, we obtain the total number of protein coding genes in the yeast S. cerevisiae genome coincident with 5800-6000, which is widely accepted.In chapter 3, we generalize the concept of the characteristic sequences of DNA primarysequences to the protein primary sequences. According to the physicochemical properties of amino acids, v;e construct characteristic sequences to represent the hydrophobicity and charged properties of the protein sequence, and give a kind of numerical description of the characteristic sequences. By comparison of the characteristic sequences, we get some information about the hydrophobicity and charged properties of amino acids on three kinds of secondary structural classes of proteins: all α- helix, all β- strand, and αβ kind protein, respectively.In last chapter, we analyse DNA sequences and their 3-dimensional graphical representations using algebraic method. First, we define some operations on DNA-curves and obtain some properties of the DNA curve using group S4 acting on the DNA curve. Besides, we define two equivalent relations on the DNA-curves, and count the number of the equivalence classes of DNA sequences. In addition, an inequality related to the entropy of equivalent sequences is proved.
Keywords/Search Tags:bioinformatics, DNA sequences, characteristic sequences, protein, secondary structural classes of protein, genome, gene recognition algorithm, condensed matrix, numerical characterization for characteristic sequences
PDF Full Text Request
Related items