Font Size: a A A

Research On Alignment Algorithm For High Throughput Sequencing

Posted on:2019-07-27Degree:MasterType:Thesis
Country:ChinaCandidate:R TaoFull Text:PDF
GTID:2428330596450496Subject:Engineering
Abstract/Summary:PDF Full Text Request
In the past few years,high-throughput sequencing has greatly expanded the field of application of sequencing technology and produced a large number of sequencing data sets.It is a prerequisite and crucial step for many biomedical research to quickly align these massive sequencing data to the genome and accurately find their original location.Therefore,many sequence alignment tools have been developed specifically for short sequence alignments.However,with the continuous development of high-throughput sequencing technology,the length of the generated sequence has increased from 36 bp to 100-150 bp,which result in that more space and longer time are needed for previous alignment tools to store the candidate solution and perform the search-and-replace function when using the backtracking algorithm to achieve fuzzy matching.Therefore,for the 100-150 bp sequences,some recent alignment tools search for their correct positions on the genome via generating and aligning short seed sequences instead of the complete one.And,obviously,the shorter the length of seeds is,the larger the number of candidate positions on the reference genome is,which leads to that the current sequence alignment tools either take several times as much space as the reference genome size to store the data or cost a considerable amount of time to search repeatedly.In order to better meet the new requirements of high-throughput sequencing alignment,we introduced hash index into the BWT index structure and proposed a novel alignment algorithm based on an improved index structure,which makes a trade-off between time and space.The idiographic work is as follows:(1)In this paper,we first studied and analyzed the partial sequence alignment tools for 100 ~ 150 bp length and found the reason for the greater memory consumption is due to that the established hash index needs to store all candidate positions on the reference genome.Considering the low memory requirement of BWT index structure,we proposed to combine hash index with BWT index structre.Firstly,we establish a BWT index for the reference genome and then use the BWT index to calculate the auxiliary data sp and ep values for all 13-bp seed sequences generated by traversal.With sp and ep values,we calculate all candidate positions using FM-Index whenever needed,and only two values can be used instead of all candidate positions when storing,which greatly reduces the space consumption.(2)In the analysis process of traditional FM-Index,we found that a sampling storage method is used to limit the space consumption,which leads to more memory accesses and larger amount of computation when searching and locating with sp and ep values.And especially when the scale of candidate locations is larger,the problem of time-consuming is more prominent.In order to optimize this process,we designed a tree-structured and breadth-first search algorithm which greatly reduces the time consumption.(3)According to different needs of the downstream analysis,the alignment mode can be divided into two classes: the full alignment mode and the best alignment mode.This article focuses on the best alignment mode.In the candidate screening phase,we propose a new screening method that is different from the seed expansion strategy.This method varies mainly in the following two ways: 1)The seed length is shorter,the seed counts increase,and only the exact matching is done;2)Combined with the optimal path coverage algorithm,the position with the highest score in the seed position set after clustering was selected.Through experiments,we first verified the effectiveness of the improved index structure.In addition,we evaluated the proposed optimization algorithm by comparing with some existing mainstream alignment algorithms on simulated data and real sequencing data at the same experimental condition.The results show that our method can better meet the needs of medium-long sequence alignments with less space requirements and higher time efficiency.
Keywords/Search Tags:High-throughput sequencing, Sequence alignment, Alignment algorithm
PDF Full Text Request
Related items