Font Size: a A A

Vocabularies Mining Based On The Genomes DNA Sequence

Posted on:2012-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:J XuFull Text:PDF
GTID:2218330362451434Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With more organisms have their genomes sequenced, how to understand the exact biological mechanisms which are encoded in the genomes is the first problem. There is a close association between natural language and biological genome which encoded a large number of genetic information. vocabulary based genome analysis is a new challenging research.In this thesis, we analysis the whole Arabidopsis genome as composed of words. we try to segment the genome sequence into words, and this can lay foundation for the further analysis of DNA sequence. Firstly we analysis the whole genome, presenting a language-independent classification algorithm of Real words and pseudo words. And then analysis different regions, we segment sequences in different regions separately into words with maximum probability.We analysis promoter region, the putative vocabularies was compared against a list of known transcript factor binding sites in Arabidopsis, 78% of the known TFBS instances have founded in the set of putative words.After analysis six regions of the Arabidopsis genome, the results of each region were used to discover the number of pyknons located in each region. we found more pyknon instances in nocoding regions than coding regions( that the intergenic region consisted of the most pyknons followed by promoters, introns, coding region, 3'UTR and 5'UTR) . the pyknon trends in the human genome are the same as in the genome of Arabidopsis.
Keywords/Search Tags:the whole genome, biological mechanisms, classification algorithm of Real words and pseudo words, maximum probability segmentation
PDF Full Text Request
Related items