Font Size: a A A

Design And Optimization Of High-Performance Algorithms For Processing Biological Sequence Data

Posted on:2020-07-15Degree:MasterType:Thesis
Country:ChinaCandidate:K C FanFull Text:PDF
GTID:2370330572988976Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Massively parallel sequencing(also called Next Generation Sequencing)is capable of generating massive sequences of nucleotide with lower cost and amazing throughput capacity,however the reads is much shorter.The increasing of throughput capacity and the decreasing of reads length make massively parallel sequencing less accurate than traditional sequencing technology.And they also make the mapping between read and reference meet a huge challenge,which result in the computing time longer transforming the data to usable information.In addition,massively data of reference and reads generated by NGS dwarfs the resources of memory.Because of the huge scale of data,variety of index technologies have been applied in the short reads mapping.They can't exploit the limited memory efficiently,and the occupancy rate is still in a high level.A novel index data structure called succinct(or sparse)hash index is presented in this paper for this problem.It is a variant of Q-gram index and make people can change the occupancy rate by setting parameter.Such as the memory occupation of the reference of human genome can reduce to 1/k.We also implement an efficient and parallel algorithm for constructing the index.What's more,as we know the mapping time accounts for a large proportion to the total time of analyzing gene data.To solve the problems that the speed and accuracy of read mapping are decreasing,we present two algorithms of seed selection-Group Seeding and Variable-length Seeding,based on succinct hash index.These algorithms are used in filter strategy to reduce the frequency of validation for improving the speed of mapping.Also,we design a fast and efficient read mapper—FEM(Fast and Efficient read Mapper)on the basis of the above.FEM will return all locations which short reads are mapped in the reference within a given threshold of edit distance.To speed up the process of mapping,multithreading is applied in FEM,and a load balance scheme is tailored to make full use of the resource of computing.We also optimize the process in term of computer architecture by implementing the algorithm with SIMD instructions in details.The result of our experiment shows that FEM is scalable.And in terms of speed and memory occupation,it outperforms other state-of-the-art all mappers.The occupancy rate of memory using succinct hash index is 1/k of using the typical hash index.And about mapping speed,FEM is 5 times faster than Massai when they both use single thread.If FEM used multi-thread,it can be two orders-of-magnitude faster.Besides,FEM is about 3 times faster than BitMapper.And FEM has an order-of-the-magnitude speedup compared to both BitMapper2 and Hobbes3.
Keywords/Search Tags:Next Generation Sequencing, Short Read Mapping, Succinct Hash Index, Group Seeding, Variable-length Seeding
PDF Full Text Request
Related items