Font Size: a A A

The Research Of Gene Identification Algorithm Based On Spectrum Analysis

Posted on:2015-11-25Degree:MasterType:Thesis
Country:ChinaCandidate:A M ChenFull Text:PDF
GTID:2180330422982407Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
With the interdisciplinary development of bioinformatics and computer technology,mathematics, physics and others, genetic research have gradually entered the Post GenomeEra. Faced with increasingly massive expansion genetic data, to comprehend these datatimely and effectively and mining knowledge with some biological significance havebecome the important goal of the gene recognition. In order to achieve the goal, we mustfind a fast, high-efficiency and accurate algorithm to solve the problem of gene recognition.And it is the most direct way.Due to the find of the3-period feature, which only observed in the coding sequences,the spectral analysis causes researchers, who work in the field of genetic research, of greatconcern. And then a lot of gene identification algorithms based on spectral analysis areemerging in large numbers. Currently, Voss and Z_curve algorithms are the two mostcommonly used gene identification methods, but both have their own advantages anddisadvantages. Firstly, by exploring the principle of the Voss and Z_curve algorithm, whichused to calculate the power spectrum and SNR (Signal to Noise Ratio) of the DNAsequence, we get the corresponding relationship of the SNR, power spectrum between thetwo algorithms. In view of this, the Quadratic-form frequency fast algorithm (QF3)algorithm is proposed. The QF3algorithm not only ensures the consistency of the outputvalue and the true value, but also avoids the computing process of DFT. Finally, thesuperiority of the QF3algorithm is verified successfully by using standard gene data, whichgot from the gene bank, EMBL. At the same time, we can find that the QF3algorithm hasvery low sensitivity that its running time has not affected to the length of the sequence.Based on the SNR, we get the threshold of SNR by comprehensively applying thebootstrap sampling algorithm and SVM classification algorithm. After the superiority of thetwo algorithms in the problem of determining the threshold of SNR had been comparativelyanalysis through experiments, we come to a conclusion that the SVM algorithm has higherclassification accuracy. Especially, under the circumstance of small samples, The SVMalgorithm expresses much better effect in the problem of determining the threshold of SNR. To identify and locate the coding regions in the DNA sequences is of key importancefor the work to gene identification. So far, the spectral curve of a fixed-length slidingwindow method and the SNR curve of a moving sequence identification method are the twomost widely used algorithms. However, after some deep analysis we can find that thepositioning accuracy of the both algorithms is relatively rough. In order to improve thepositioning accuracy, and after the approximate range of the coding regions have obtainedbase on the two algorithms, we put forward the new method, FWSMC algorithm, with thebiology tool, sequence viewer, to adjust the interval endpoint s. FWSMC algorithm notonly improve the positioning accuracy, but the experimental process is very rigorous, at thesame time, the experimental results with a strong visual effect.In the last chapter of the thesis, it gives an integrated application of the new algorithmsproposed in this paper. That is the comprehensive emulation experiment of the five givenDNA sequences. The experiment results show that we can successfully identify andaccurately locate the gene coding region of each sequence. At the same time, the experimentimplements the determination of species for each sequence.
Keywords/Search Tags:gene recognition, QF3algorithm, bootstrap sampling algorithm, SVM, FWSMC method
PDF Full Text Request
Related items