Font Size: a A A

Study On Algorithms For Identification Of Repeats In Large-scale Genome

Posted on:2008-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:F BaiFull Text:PDF
GTID:2178360212974585Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Repeat Identification is the most common fundamental subject of genome analysis in modern bioinformatics. Through repeat identification, the roles in genome evolution and inheritance of disease can be found. Many transposons and retrotransposons which contain coding regions exist in genome sequences. Identification of these repeats is important to decode genome. Although a lot of algorithm was proposed to solve this problem, but there is not an optimal algorithm of repeat identification. For current flaws we present a novel kind of algorithm for repeat identification based on seed sequences.Two methods, RepeatSearcher and GSRSearcher were proposed in this paper, which based on extension of seed sequences. Using sequences which include seed, RepeatSearcher translate local pair-wise alignment into multiple sequences alignment, combining gapped penalty in limited area. Algorithm extends consensus sequence according score of alignment, and at the same time extends every repeat sequence. In this way, the accuracy boundaries of repeat sequences can be conformed when extending consensus sequence. Multiple alignment greatly avoid the imprecise of high score pairs. GSRSearcher inherit the way of seed's extension and make use of statistical function of Gibbs Sampling. Considering infection of background in genome, the repeat family sequences which were identified will be more accurate. Using probability statistical policy, the speed of convergence in GSRSearcher is more reasonable then the speed of convergence in RepeatSearcher and can judges the boundaries of repeat sequence exactly.In the end, the report tested twelve kinds of genome sequences of mammal on RepeatSearcher and GSRSearcher, and then compared the output with RepBase and the output of RECON. The result shows that our algorithm is better than RECON and is an effect algorithm.
Keywords/Search Tags:bioinformatics, repeat identification, consensus sequence, accurate boundaries, seed sequence
PDF Full Text Request
Related items