Font Size: a A A

Research On Parallel Algorithms For Biological Sequence Analysis On Multi-core CPU And GPU

Posted on:2014-11-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z H ZhaoFull Text:PDF
GTID:1228330422973921Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Biological sequence analysis is a traditional research area in bioinformatics and ithas been studied intensively. However, the size and characters of biological sequencedata have been changed as the sequencing technologies developed and new softwares andcomputing platforms are required to analyze the data. Parallel computing has been playedanimportant rolein analyzing large-scalesequence data andsolving complexproblems inbiological sequence analysis. Currently, multi-core CPU and GPU are two popular kindsof parallel processing architectures. On the one hand, multi-core CPUs and GPUs areequipped in desktop computers, workstations and servers widely, on the other hand, theCPU-GPU heterogeneous parallel systems are an important development trend in high-performance computing. Therefore, the research on the parallel algorithms for biologicalsequence analysis on multi-core CPU and GPU has significant meaning.Under this background, we have studied the parallel algorithms on multi-core CPUand GPU from three foundational aspects of biological sequence analysis which are theconstruction of biological sequence index structure, motif discovery, error correction andread alignment of high-throughput sequencing data. Their performance is very importantfor the application of biological sequence analysis. The main work and contribution ofthis thesis are as follows:1. Proposing two parallel algorithms for computing the Burrow-Wheeler Transformon multi-core architectures and on GPU respectively for biological sequence data. TheBurrow-Wheeler Transform plays a key role in indexing biological sequence data. How-ever, computing the Burrow-Wheeler Transform from a long sequence in efficient spaceand time is challenging. The thesis analyzes two existing space-efficient algorithms forbuilding the Burrow-Wheeler Transform and then parallelizes them on multi-core ar-chitectures and on GPU respectively according to their characters. For computing theBurrow-Wheeler Transform from a long sequence, our parallel algorithm on multi-corearchitectures can achieve near linear speed-up with the addition of processing cores andour GPU-based algorithm is about16times faster than the original CPU implementation.When using the two algorithms to compute the Burrow-Wheeler Transform of humanwhole genome, the running time of the former on8cores is comparable to that of thelatter on a NVIDIA GTX580GPU, which is less than4minutes. Furthermore, the two algorithms are also space-efficient.2. Proposing an exact GPU-based algorithm for the motif discovery problem in bi-ological sequences. The motif discovery problem can be formulated as an algorithmicproblem which is called the planted (l,d)-motif problem. Based on the sample-driven ap-proach, we uses full cliques to represent the motifs which is more stringent and accuratethan the cliques of previous algorithms. We implement the full clique search algorithmon GPU. Experiments show that our algorithm is more efficient and scalable in handlinglonger and weaker motifs than the existing algorithms. It can solve the long weak (l,d)-motif problems such as (40,14) within5minutes on a desktop computer equipped with aNVIDIA GTX580display card while previous tools that can solve it need over5hoursand as far as we know, it is the first tool that can solve the (25,10)-and (27,11)-motifproblems. Furthermore, it can achieve near linear speed-up on multi-GPU systems.3. Proposing a parallel algorithm on multi-core computers for short read error cor-rectioninlarge-scalehigh-throughputsequencingdata. High-throughputsequencingplat-forms can produce a large amount of short reads which need much time and space tocompute the index structure for error correction. We improve a previous algorithm whichuses suffix array to index short reads by replacing full suffix array with partial suffix arraywhich is more time and space efficient and parallelize the procedure of building the par-tial suffix array using multithreaded programming. Experiments show that our algorithmis more time-efficient and space-efficient than the original algorithm and it is a scalableparallel algorithm that can works well on multi-core computers using Pthread.4. Proposing a GPU-based algorithm to correct various types of errors in the shortreads produced by different high-throughput sequencing platforms. There are three type-s of errors in high-throughput sequencing data which are substitutions, insertions anddeletions and different types of high-throughput sequencing platforms will produce dif-ferent types of errors. To correct various types of errors in the short reads, we combine analignment-freemethodwithmultiplealignment. Wefirstapplythealignment-freemethodto index the short reads, and then construct multiple alignments of the reads with the samesubstrings to find all types of errors. We accelerate the procedure of constructing multiplealignments, for it is very time-consuming. Experiments show that our algorithm can cor-rect all types of errors in short read data and is more accurate than previous algorithms.Furthermore, with the help of GPU, our algorithm is also comparable or even faster than previous algorithms.5. Proposing a GPU-based algorithm for the k-mismatch mapping of long reads inhigh-throughput sequencing data to reference genomes. The reads produced by high-throughput sequencing tend to become longer, but most of existing mapping algorithmswere designed for short reads. Therefore, the thesis investigates the long read mappingproblem and proposes a GPU-based algorithm for exact k-mismatch mapping. We parti-tion long reads into short segments by the pigeonhole principle and find the exact match-ingofshortsegmentsinreferencegenomesonGPUbasedonBurrow-WheelerTransformwhich can be extended allowing k mismatches. Experiments show that our algorithm cansolve the k-mismatch mapping problem of long reads in high efficiency and it is fasterand more accurate than previous algorithms.
Keywords/Search Tags:Parallel Algorithm, BiologicalSequenceAnalysis, High-throughputSequencing, Burrow-Wheeler Transform, Motif Discovery, Long Read Alignment, Short Read Error Correction, GPU
PDF Full Text Request
Related items