A Cache-oblivious Exact Alignment Algorithm For DNA Short Reads

Posted on:2014-10-15

Degree:Master

Type:Thesis

Country:China

Candidate:R D Li

Full Text:PDF

GTID:2250330422451941

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the wide application of the next-generation sequencing technology, themass sequencing data it generates brings formidable challenges for the analysis ofsubsequent sequences. As a pre-processing step of the sequence analysis, thesequence alignment’s performance affects the entire sequence analysis’s efficiency.The sequence alignment can be divided into the following three forms: exactmatching, fuzzy matching within fixing errors and matching of inserting anddeleting. The main subject of this thesis is the accurate matching of mass data and itfocuses on quick positioning of the data on the basis of reducing the comparison.Therefore, in this paper we establish hash indexes of reference genes andsequencing data. The index mainly deals with storing the sequence fragmentstogether which have the same first k bases, so that quick positioning and comparisoncan be achieved.In the mass data sequence alignment the reference genes and sequences stillcontain many to compare under the circumstances that the first k bases are the same.It results in a large number of comparisons and to reduce the comparisons we sortthe data needed to be compared first. Based on the specialty of the data needed to becompared the paper comes up with a cache-oblivious algorithm which can speed upthe comparison. Meanwhile, in order to fully utilize the resources of moderncomputers, we develop a multi-threaded sequence alignment program to acceleratethe matching speed.We compare different methods under different data sizes and find the matchingmethod suitable for its own data size. The result is that cache-oblivious matching forthe small data size, normal quick sorting matching for medium data size, andcache-oblivious quick sorting for mass data. In addition, we compare the programimplemented in this thesis with other matching methods and the conclusion is that ithas more advantages than other programs when dealing with large-scale data.

Keywords/Search Tags:

mass sequence data, exact alignment, cache-oblivious, multi-threadedcomparison

PDF Full Text Request

Related items

1	Study Of Several Algorithms For Alignment Problem Of Sequence And Sequence Secondary Structure
2	Research On Large-scale Biological Sequence Alignment Algorithms And Their Parallelization
3	Research On Sequence Alignment Methods For The Third-generation Sequencing Data
4	Research On Parallel Optimization Method Of Biological Information Sequence Alignment In Multi - Core Environment
5	Gene Sequence Alignment Algorithm Research And Implement In SNP
6	Research On Global Sequence Alignment For Intel Multi-core And Many-core Platforms
7	Design And Analysis Of Quantum Oblivious Transfer Protocols
8	Research On Genomic Sequence Alignment Methods Based On High-throughput Sequencing Data
9	A Sequence Alignment Algorithm With Combining Variants Data
10	Parallel Optimization Design And Implementation Of Biological Sequence Alignment Algorithm