Font Size: a A A

Analysis Of Coding Features Of DNA Sequences Based On Error-Correction Coding Theory

Posted on:2011-01-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:X LiuFull Text:PDF
GTID:1100360308457756Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Researches in modern biology are based on multi-interdisciplinary subjects, instead of single one. The complexity of biological systems requires the crossing of various theories and methods. The rapid increasing data obtained from genetic engineering have aroused the scholars'interest to study the biological systems as information transmission systems. Based on the similiarity of information transmission and coding between biological systems and modern communication engineering, the error-correction coding theory of modern communication engineering is employed for the study of genetic sequences and design of biological test systems, which has resulted in some obvious progresses.In our research, we studied the information analysis method of biosystem based on error-correction coding theory of communication engineering and sequences of some objects were analyzed. This will help us explore a new approach of applying communication coding theory to biological field.The relative work is as follows:1. A codon is treated as a basic genetic information unit, instead of a nucleotide, based on the importance of codons in the expressing of genetic information. Considering interaction between adjacent codons, we designed a (6,3) block code model for analysis, using the design method of block code encoding model in communication coding theory as reference. DNA sequences of the twelve procaryotic organisms and nine eukaryotic organisms with different GC content were selected for analyzing with the (6,3) block code model. Code distance was used as a characteristic parameter for detecting the corresponding biological feature. We observe that average code distances fluctuate obviously near the initiation codon and termination codon. Remarkable changes also appear in the SD field of procaryotic organisms.2. We know that convolutional code model is always better than block code mode in coding system, which inspires us to study and search better convolutional code model for the analyzing of DNA sequences. Considering the convolutional code encoding model and the results based on our block code model, we designed a (6,3,1) convolutional code-based model according to the degeneracy of codons, context of condons, short-range dominance of bases correlation and a codon being a information unit. And then, we analyzed the selected DNA sequences of the twelve procaryotic organisms and nine eukaryotic organisms with the (6,3,1) convolutional code model. We observe that average code distances fluctuate obviously near the initiation codon and termination codon. Remarkable changes also appear in the SD field of procaryotic organisms. We also observe obvious period-3 feature in the coding region of all objects. We defined a new parameter, characteristic average code distance (CACD), to describe the separation of average code distance curves of different objects with different GC contents (especially for procaryotic organisms). CACDs are relative to GC contents and proportional to the corresponding GC contents of procaryotic organisms approximately. So, the code parameter carries certain biological information. This shows that this model deserves further study and usage in bioinformation processing.We establishe these models on the basis of general features of genetic information, so it is species-independent and suitable for various kinds of objects analysis without model's adjustment.3. Focusing on the convolutional code model, we compared some model parameters based on short-range dominance of bases correlation. Considering a nucleotide as a genetic information unit as usually, we selected (2,1,1) convolutional code model. And (3,2,1) model was selected as a transition. We compared code length of coding output and code length for code distance calculation, and then confirmed that (6,3,1), (3,2,1) and (2,1,1) models can provide good results.4. The analysis models based on error-correction coding theory were used for similarity study of DNA sequences. We studied the similarities/dissimilarities among the coding sequences of the first exon ofβ-globin gene of 11 species (human, goat, opossum, gallus, lemur, mouse, rabbit, rat, gorilla, bovine and chimpanzee) with the (6,3,1), (3,2,1) and (2,1,1) convolutional models. We constructed an 8-component vector whose components were the normalized leading eigenvalues of the L/L and M/M matrices. Based on the Euclidean distances between the end points of the 8-component vectors, the simulation illustrates that the three kinds of Primates (human, chimpanzee, and gorilla) are similar to each other strongly because of their evolutionary relationship, and opossum (the most remote species from the remaining mammals) and gallus (the only non-mammalian representative) are of weak similarity to the others. The results demonstrate that the approach can reflect the important information of the DNA sequences considered.
Keywords/Search Tags:Error-correction coding, Genetic information, DNA sequence, Degeneracy, Short-range dominance of bases correlation
PDF Full Text Request
Related items