Font Size: a A A

Research On BWT Index And Algorithms In Biological Sequence Alignment

Posted on:2016-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y N ZhaoFull Text:PDF
GTID:2180330470957842Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Biological DNA sequence alignment is the core issue in bioinformatics and is one of the basic researches. The sequence alignment algorithm is a common tool in sequence analysis, which is also the first step in RNA alignment and personalized medicine and many other downstream biological researches. The rapid development of new generation of high-throughput sequencing technology results in a large-scale sequencing reads. How to achieve high accuracy and speed in sequencing mapping is the challenge faced by bioinformatics. Index as an important technology to process large-scale genomes and reads has been widely used by the vast majority of the current algorithms, which becomes a key point in mapping researches. This paper aims at the widely used indexing technology BWT(Burrows-Wheeler Transform) and sequence alignment algorithms in all-mapper mode, the main content and contributions include:(1) Research on second-order BWT with double-character indexFirstly, we analysis the traditional BWT technology which searches text each bit at a time. Aiming at its too much memory access, large amount of calculation and time-consuming searching process when facing large-scale data, we proposes a new second-order BWT index concept as well as its implementation. Based on the traditional BWT’s auxiliary data structures, our algorithm can find two characters at one time which can reduce the frequency of loop and calculation in sequence alignment algorithm. It can also reduce the mapping algorithm complexity by half and improve the search efficiency. The time complexity of algorithm analysis and experimental results show that, compared to traditional BWT index which we call first-order, the second-order BWT index’s performance improves around35%. In order to further reduce the suffix array’s transform time in BWT, we also design a coding method to speed up the matching of double-character index. Experimental results show that the improved algorithm can get an extra performance boost of about10%.(2) Research on mixed fast indexing technology with combination of BWT and Hash and sequence alignment algorithmAccording to the differences of pattern finding and application scenarios, sequence alignment problem can be divided into two categories:all-mappers and best-mappers. The all-mappers report all the positions that are satisfied the finding condition while the best-mappers only report the top k positions that are satisfied the conditions. In this paper, we do researches on all-mappers. With the combination of BWT index and Hash index algorithms’advantages, we propose a new hybrid fast indexing technology which achieves a better balance of time complexity and space complexity in index algorithm. Based on the structure of the hybrid indexing and combining the seed division method based on the pigeonhole principle, we thus give a whole new all-mapper algorithm. Experimental results show that, compared to the existing sequence alignment algorithms, our algorithm can improve running time by30%as well as keeping the same recall ratio with the space only increased by3%. To further improve running speed, we design a transform algorithm which read the SA array block at one time to improve the cache efficiency. The method overcome the shortcomings in BWT transforming stage such as excessive jumping and low cache hit rate. Experimental results show that the improved algorithm can get an extra performance boost of about10%.
Keywords/Search Tags:bioinformatics, sequence alignment, index, BWT technology, Hashindex
PDF Full Text Request
Related items