| The booming development of next-generation sequencing has been advancing the technology in sequence alignment,and the burden from the enormous amount of reads is becoming heavier.As one of primary steps of DNA sequence analysis,the approximate matching is the key area of bioinformatics.As for k-mismatch alignment,state-of-the-art aligners based on Burrows-Wheeler transform(BWT)use various heuristics to improve the speed of alignment,but there is a trade-off betweent the efficiency and the accuracy of alignment.Some of them could not handle a large amount of mismatches,neither is the high sensitivity guaranteed.With the in-depth study of BWT,the novel spaced Burrows-Wheeler transform(SBWT)was proposed.The results include:Similary to BWT,SBWT is reversible and its LF Mapping also applies to exact mapping.The spaced suffix array based on SBWT is the suffix-tree variant,and BWT can be constructed by 1-SBWT where its period is 1.However,there exist some differences between them.First,The number of $ to be appended to the end of the sequence is related to both the period and the length of reference sequence,while there is only one $ in BWT.Second,the spaced multi-key fast sort algorithm is implemented to transform the reference periodically.Finally,the transformed string is retained from the ? column of suffix matrix.The repeative sequence distribution in human genome was investigated,and we found that the duplication ratio of consective seeds amost is 1:7,and the frequency distribution had one long tail,but approximately only one of twenty one spaced seeds had mirror sites in the reference.Spaced seeds in SBWT,compared with consecutive seeds,increase filtration specificity while preserving high sensitivity.The second index is designed to power the verification of alignment algorithm.As the last step of alignment,the candidates are mapped to the read and only those whose Hamming distance with the read meets the criterion are reserved.Sequence encoding and the second index built from homogeneous spaced suffix region both amortize the cost of verification when dealing with the repetitiveness of genome.Combined together,these methods significantly speed up k-mismatch searching.Although heavy memory usage is put into service,SBWT powered by spaced seeds and other auxiliart data structure is ~5-21 faster than other BWT-based aligners. |