Font Size: a A A

An Identification Study Of The Five Model Species Genomes

Posted on:2005-12-05Degree:MasterType:Thesis
Country:ChinaCandidate:C X ChenFull Text:PDF
GTID:2120360155476531Subject:Biophysics
Abstract/Summary:PDF Full Text Request
The complete sequences of the 5 model specie genomes are divided under three groups: introns, exons and intergenicDNA. Based on the conservation of nucleotides around splice site, the compositional feature and reading frame with 3-periodicity in coding sequence, the three kinds of the sequences are predicted by use of the least increment of diversity (LID), in which different parameters of the diversity sources were chosen. Some of usefull information parameters are found after the analysis of the results.In the first part: the least increment of diversity is introduced. Three increments of diversity between the standard measures of diversity D(Xe), D(Xi), D(Xs) and measure of diversity D(X) of one sequence is respectively calculated for the differrent subsets. The class of one sequence is determined by minimum of the increment of diversity. In addition, evaluation criterion indexes are introduced in the prediction process.In the second part, based on the statistical analysis of length distribution, the 21 trimersprobabilities and 12 signals of the three kinds of sequences are selected. The results show that thecorrect prediction rates are 84.26% for the standard set of A. thaliana and 84.64% for test set with21 parameters. 81.13% for the standard set and 81.68% for the test set with 12 parametersrespectively.In the third part, one third sequences is randomly chosen as standard sets, the others being the test sets. According to theory of the unequality of the 64 trimers' probabilities, the 64,40 and 20 trimers' probabilities of the three kinds of sequences are respectively selected as parameters of the standard sources of diversity, after the analysis of length distribution. It is shown that the overall prediction accuracies are 82.19% and 87.95% in the standard-sets and test-sets for A.thaliana'sgenomes with 64 trimers; 79.67% and 81.93% in the standard sets and test sets for C.elegans' genomes respectively.In the fourth part, the conservation of nucleotides around splice site, the compositional feature and the existence of reading frame with 3-periodicity in coding sequence are analyzed. The three standard sources of diversity are respectively determined by the possibilities of 64 trimers on the whole sequence and 4 bases at 30 positions around the splice sites. The results show that the higher rates of correct prediction with 184 information parameters have been obtained from all sets than only with 64 signals. The prediction accuracies are 87.37%, 90.72%, 91.08%, 92.28 %, and 92.88% for C.elegans(C), S. cerevisiae (S), A.thaliana(A) , D.melanogasters(D) , and E.coli (E) genome, respectively.It is shown that the least increment of diversity is a kind of useful method for the identification of gene structure. Moreover, the choosing of information parameters in the source of diversity is also important during the whole prediction process.
Keywords/Search Tags:measure of diversity, the least of increment diversity, model species DNA sequences, exon, intron, intergenicDNA, splice site, prediction
PDF Full Text Request
Related items