Font Size: a A A

A Cache-oblivious Exact Alignment Algorithm For DNA Short Reads

Posted on:2014-10-15Degree:MasterType:Thesis
Country:ChinaCandidate:R D LiFull Text:PDF
GTID:2250330422451941Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the wide application of the next-generation sequencing technology, themass sequencing data it generates brings formidable challenges for the analysis ofsubsequent sequences. As a pre-processing step of the sequence analysis, thesequence alignment’s performance affects the entire sequence analysis’s efficiency.The sequence alignment can be divided into the following three forms: exactmatching, fuzzy matching within fixing errors and matching of inserting anddeleting. The main subject of this thesis is the accurate matching of mass data and itfocuses on quick positioning of the data on the basis of reducing the comparison.Therefore, in this paper we establish hash indexes of reference genes andsequencing data. The index mainly deals with storing the sequence fragmentstogether which have the same first k bases, so that quick positioning and comparisoncan be achieved.In the mass data sequence alignment the reference genes and sequences stillcontain many to compare under the circumstances that the first k bases are the same.It results in a large number of comparisons and to reduce the comparisons we sortthe data needed to be compared first. Based on the specialty of the data needed to becompared the paper comes up with a cache-oblivious algorithm which can speed upthe comparison. Meanwhile, in order to fully utilize the resources of moderncomputers, we develop a multi-threaded sequence alignment program to acceleratethe matching speed.We compare different methods under different data sizes and find the matchingmethod suitable for its own data size. The result is that cache-oblivious matching forthe small data size, normal quick sorting matching for medium data size, andcache-oblivious quick sorting for mass data. In addition, we compare the programimplemented in this thesis with other matching methods and the conclusion is that ithas more advantages than other programs when dealing with large-scale data.
Keywords/Search Tags:mass sequence data, exact alignment, cache-oblivious, multi-threadedcomparison
PDF Full Text Request
Related items