Font Size: a A A

Recognition And Prediction On ORF, Intron And Exon For Several Model Genomes

Posted on:2004-11-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:L R ZhangFull Text:PDF
GTID:1100360125952798Subject:Theoretical Physics
Abstract/Summary:PDF Full Text Request
Several model genomes have been studied. Mainly, they include E.coli, Yeast, C.elegans, A.thaliana, D.melanogaster and human. The purpose of the thesis is to find a way to differentiate coding regions from other non-coding DNA sequences for these model genomes, to study the structure of gene and to recognize exons and introns in a gene, and to study the organization of ORFs in some simple organisms.After a brief review of the present status on gene recognition and prediction (Chapter 2) and a concise introduction on parameter definition used in this thesis (Chapter 3) the main work is divided into four parts.In chapter 4, the rules on gene recognition and ORF organization in Saccharomyces cerevisiae genome are demonstrated by statistical analyses of sequence data. It includes: 1) The random frame rule - six reading frames W1, W2, W3, C1, C2 and C3 in double chains are randomly occupied by ORFs. The related phenomena on ORF overlapping have also been discussed. 2) The first ATG rule - ORF initiates from the first in-frame ATG codon in the DNA sequence after the nearest upstream terminator (TAA, TAG or TGA). The rule holds at an accuracy 99.7%. 3) The inhomogeneity rule - coding and non-coding ORFs differ in inhomogeneity of base composition on three codon positions. By use of inhomogeneity index (IHI) one can make a distinction between coding (1HI>14) and non-coding (IHI 14) ORFs at 95% accuracy. We find that the "spurious" ORFs (IHI 14) distributed mainly in three classes, namely, in the class of "similarity to unknown proteins", "no similarity" and "questionable ORFs". The total number of spurious ORFs (which are unlikely to be regarded as coding ORFs) is estimated to be 470.In chapter 5, the general feature of nucleotide distribution in exon and intron is studied in detail for C.elegans genome. Base on the heterogeneity of distribution of exon on three codon sites, the heterogeneity index IHI is further employed in the studey of the sequence character of multi-exon and multi-intron. By generlizing parameter IHI, the interference index R is defined. The character of several sequence segments of exon and intron is studied by use of R ,and the relation between the distribution of R and the number of sequence segments of exon and intron is gained.In chapter 6, the recognition of ORF and exon in several representative lower organisms, E.coli, S. cerevisiae and C. elegans, is discussed. Based on the compositional feature and the existence of reading frame with 3-periodicity in coding sequence, a sequence is supposed to be divided into three subsequences. By use of the numbers of four bases in each subsequence, the classification of the sequence into exon, intron or intergenic sequence in some model species (C. elegans, S. cerevisiae and E.coli) can be predicted. Through a unified approach, the introduction of diversity measures and the minimization of increment of diversity (ID) and relative increment of diversity (RID), we distinguish exons and ORFs from non-coding regions with a high successful rate. The accuracy (the average of sensitivity and specificity) of the prediction has attained about 90% for exons in C .elegans genome, generally higher than 95% for ORFs in S. cerevisiae and E.coli genomes. However, making the distinction between intron and intergenic sequence in C. elegans genome seems more difficult than between others.Chapter 7 is the most important part of this thesis where the gene splicing from lower to higher organisms is studied in detail. Based on the conservation of nucleotides at splicing sites and the features of base composition and base correlation around these sites we use the method of increment of diversity combined with quadratic discriminant analysis (IDQD) to study the dependence structure of splicing sites and predict the exons/introns and their boundaries for four model genomes - C.elegans, A.thaliana, D.melanogaster and human. The comparison of compositional features between two sequences and the comparison of base dependencies at adjacent or non-adjacent positions of two...
Keywords/Search Tags:model organism, genome, structure of gene, exon, intron, open reading frame(ORF), recognition, organization, prediction, random frame, inhomogeneity index(IHI), intergenic sequence, interference index(R), increment of divcrsity(ID)
PDF Full Text Request
Related items