Design And Optimization Of High-Performance Algorithms For Processing Biological Sequence Data

Posted on:2020-07-15

Degree:Master

Type:Thesis

Country:China

Candidate:K C Fan

Full Text:PDF

GTID:2370330572988976

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Massively parallel sequencing(also called Next Generation Sequencing)is capable of generating massive sequences of nucleotide with lower cost and amazing throughput capacity,however the reads is much shorter.The increasing of throughput capacity and the decreasing of reads length make massively parallel sequencing less accurate than traditional sequencing technology.And they also make the mapping between read and reference meet a huge challenge,which result in the computing time longer transforming the data to usable information.In addition,massively data of reference and reads generated by NGS dwarfs the resources of memory.Because of the huge scale of data,variety of index technologies have been applied in the short reads mapping.They can't exploit the limited memory efficiently,and the occupancy rate is still in a high level.A novel index data structure called succinct(or sparse)hash index is presented in this paper for this problem.It is a variant of Q-gram index and make people can change the occupancy rate by setting parameter.Such as the memory occupation of the reference of human genome can reduce to 1/k.We also implement an efficient and parallel algorithm for constructing the index.What's more,as we know the mapping time accounts for a large proportion to the total time of analyzing gene data.To solve the problems that the speed and accuracy of read mapping are decreasing,we present two algorithms of seed selection-Group Seeding and Variable-length Seeding,based on succinct hash index.These algorithms are used in filter strategy to reduce the frequency of validation for improving the speed of mapping.Also,we design a fast and efficient read mapper�FEM(Fast and Efficient read Mapper)on the basis of the above.FEM will return all locations which short reads are mapped in the reference within a given threshold of edit distance.To speed up the process of mapping,multithreading is applied in FEM,and a load balance scheme is tailored to make full use of the resource of computing.We also optimize the process in term of computer architecture by implementing the algorithm with SIMD instructions in details.The result of our experiment shows that FEM is scalable.And in terms of speed and memory occupation,it outperforms other state-of-the-art all mappers.The occupancy rate of memory using succinct hash index is 1/k of using the typical hash index.And about mapping speed,FEM is 5 times faster than Massai when they both use single thread.If FEM used multi-thread,it can be two orders-of-magnitude faster.Besides,FEM is about 3 times faster than BitMapper.And FEM has an order-of-the-magnitude speedup compared to both BitMapper2 and Hobbes3.

Keywords/Search Tags:

Next Generation Sequencing, Short Read Mapping, Succinct Hash Index, Group Seeding, Variable-length Seeding

PDF Full Text Request

Related items

1	Design And Implementation Of A Parallel DNA Sequences Mapping System Based On MPI
2	The Study On Read Alignment Algorithm For High-throughput Sequencing Datasets
3	Research And Optimization On All-Mapping Read Aligner Based On Variable-Length Seeds
4	Researches Of Short Sequence Alignment And Scaffold Algorithm Based On Next Generation Sequencing
5	Researches On Long Read Alignment Algorithms Oriented To The Third Generation Sequencing Technology
6	Research On Salt-Seeding Numerical Simulations In Convective Cloud
7	Study On Effects Of Different Reed Rhizome Length Transplanting Conditions On Seeding And Growth
8	Optimizing High-throughput Biological Gene Sequencing Data Processing Algorithms Based On Hash
9	Numerical Simulation Study On Seeding Potential Of Stratiform Cloud Systems
10	Sowing Rainfall Of Convective Clouds Hail Three-dimensional Numerical Simulation