Font Size: a A A

Space-efficient Short Read Alignment With Compressed Suffix Array

Posted on:2015-04-30Degree:MasterType:Thesis
Country:ChinaCandidate:S J LiFull Text:PDF
GTID:2308330464964657Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Nowadays,decreasing cost and better accessibility of next generation sequencing methods have produced a large amount of short reads whic are calling for the development of fast and accurate read alignment programs.The first generation of hash-table based methods has been developed,including MAQ,which is accurate,feature rich and fast enough to align short reads from a single individual.However,Bowtie does not support gapped alignment of longer reads where indels may occur frequently.On the other hand,recent experimental studies on compressed index(BWT,CSA,FM-index)have confirmed their practicality for indexing very long strings such as human genome in the main memory,and many alignment methods based on compressed index have been developed,for example,BWA.In this paper we show how to build a software called CSAA that exploits a CSA index of reference sequence,and performs well on alignment speed and accuration.We proposed and implemented Compressed Suffix Array Alignment(CSAA),a new short read alignment tool that is based on backward search with compressed suffix array as index method,to align short reads to a large reference such as human genome.CSAA uses a search tree on multiple proximate sequences to support mismatch and gapped alignment,CSAA also introduced a heap like structure to decrease search space on seach tree.Finally,with the help of penalty strategy and seed,CSAA achieved similar accuracy and faster speed than MAQ.CSAA has three advantages. Firstly,increment CSA construction algorithm has been used in CSAA,which directly constructs CSA without SA,and uses little memory to classic CSA construction algorithm. Secondly,CSAA uses seed strategy to speed up alignment,which can drop most of invalid search direction when aligning the first dozens of nucleotides of a read.Lastly,indpendency of every short read’s aligment makes parallel aligning avaliable.CSAA speed up efficiently by adopting multi-thread.CSAA supports single-end and pair-end mapping with Fastq as input format and SAM(Sequence Alignment Map) as output format.CSAA also support multi-thread running on a multi-core machine to get a faster alignment speed.
Keywords/Search Tags:short reads alignment, DNA sequence alignment, compressed index, compressed suffix array
PDF Full Text Request
Related items