Font Size: a A A

The Research Of Gene-finding Algorithms Based On Statistics

Posted on:2008-08-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y FangFull Text:PDF
GTID:2178360212995891Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Driven by the development of biological science and technology, bioinformatics is a new fast-growing subject where biology, applied mathematics and computer science overlap. Budding in 1950s and mushrooming in 1990s, bioinformatics has already grown into an important foreland in life sciences and one of the core fields of natural science in 21 century. Equipped with computer and computer network and utilizing theory, method and technology of mathematics and information science, bioinformatics researches biological big molecule, focusing on nucleic acid and protein including their sequence, structure and function. An important task in nucleic acid sequence analysis is to deal with a large number of DNA sequences, which are already mensurated but with unknown function and unnoted, by separating sequences into different areas including gene, promoter and control region, etc. and annotate DNA sequences. The first major task in annotating a gene sequence is gene identification, i.e., to find out all the genes in a sequence. Method of gene identification in DNA sequences is an important research subject in bioinformatics. One of the key issues in the process of gene identification is to forecast coding region, which refers to the forecast of coding DNA sequence and the exon for eukaryotic gene. The final goal of gene identification is to forecast intact gene structure and identify all the exons and their boundaries of a gene, in order to provide reliable sequence data for further gene function annotation, biological experiment and ultimately for further development of genomics.As the Human genome projects are entering the large scale sequencing phase, computer programs are becoming essential to identify protein coding genes in large uncharacterized genomic sequences--typically of tens of thousands, or even hundreds of thousands of nucleotides-- with efficiency and reliability. At the core all gene identification programs there exist one or more coding measures. Some programs rely on additional information mainly, potential sequence signals,and sequence similarity database searches or base on coding statistic . A coding statisticcan be defined as a function that computes given a DNA sequence a real number related to the likelihood that the sequence is coding for a protein.Our classification of coding measures is, however, slightly different. The main distinction here is between measures dependent of a model of coding DNA, and measures independent of such a model. The model of coding DNA is always probabilistic, allowing to compute the probability of a DNA sequence, given that the sequence is coding. For the model-based coding statistics we will compute the values (scores) of a given coding statistic in a query sequence based on such a probability. Model dependent coding statistics are likely to capture more of the specific features of coding DNA. Therefore, model dependent coding statistics may be more powerful in discriminating coding from non-coding DNA. Model dependent coding statistics, however, require of a representative sample of coding DNA from the species under consideration where to estimate the model's parameters (probabilities). Model independent coding statistics, on the other hand, capture only the universal features of coding DNA; since they do not require of a sample of coding DNA, they can be used even in absence of previously known coding regions from the species under consideration.Here we will introduce a few of typical coding measures base on statistics such as measures base on codon usage bias(Codon Usage), dependence between nucleotide positions(HMM),base compositional bias between codon positions(Position Asymmetry), or periodicity in base occurrence(Periodic Asymmetry Index).A combined algorithm is proposed on the basis, which declares to combine species specificity information and statistical characteristic information of general presentation. The algorithm weight combines measures dependent of a model of coding DNA and measures independent of such a model, modulates weight factor according to statistical data of specific genome, and decides the adoption of the two ways of arithmetic, in order to fully and reasonably utilize sequence information to acquire an ideal identification result. A combined algorithm is finally realized with C, whose validity is testified with laboratory data.
Keywords/Search Tags:Gene-finding
PDF Full Text Request
Related items