Font Size: a A A

High Performance Algorithms And Models For Large-scale Biological Sequence Analysis

Posted on:2015-01-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Y YangFull Text:PDF
GTID:1260330428984431Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of new DNA sequencing technologies, an enormous amount of sequence data has been generated, which presents great challenges for computational biology problems. Therefore, how to deal with the rapid growth of data volumes be-comes a problem needing to be solved urgently. In order to handle the large scale datasets, we study the methods from three different levels, include data organization, algorithm development and parallelization. Data organization is to establish models for data presentation and organization, which contains the global information and could improve the efficiency of data analysis; Algorithm development is to design efficient algorithms for large scale datasets, which have small time complexity or ouput good solutions as soon as possible (i.e. anytime algorithms); Parallelization must be consid-ered when dealing with large scale datasets, and the parallelization of algorithms and load balance should be solved.In this dissertation, we choose three important biological sequences analysis prob-lems, including haplotype phasing, motif search and longest common subsequence, to study the methods and technologies of large scale data analysis. The works of this dis-sertation include:(1) Haplotype phasing problem:haplotype is composed of the SNPs on a single chromosome, which is related to human diseases. The sequenced data are often geno-types, i.e. the combination of two haplotypes, therefore, the genotypes need to be phased into haplotypes. In this dissertation, we studied the population haplotype phasing prob-lem. At first, we established the flow network model and analyzed the phasing rules on this model in order to obtain some heuristic knowledge. Then, a new haplotype phasing algorithm FNphasing was presented. When applied on large scale data sets, FNphasing is significantly faster than the state-of-the-art algorithms, and also achieves an equal or superior accuracy compared with other approaches.(2) Motif search problem:Motifs are recurring and conserved regions in biologi-cal sequences, which have molecular structural or functional features, the identification of them can help us better understand the mechanisms of life. In this dissertation, we studied (I, d) motif search problem. At first, we adopted a new hashing strategy to re-duce the number of potential motifs. Then a new pruning algorithm was designed to reduce the average time complexity. When solving challenging instances, the new al-gorithm CVoting is an order of magnitude faster than current algorithms, and uses much less space.(3) Longest common subsequence problem:finding the longest common subse- quence an important method to identify the similarity of sequences, which could serve as the measurement of how two different species are evolutionarily related. In this disser-tation, we studied multiple longest common subsequence (MLCS) problem. At first, we coverted the longest common subsequence problem into a graph search problem. Then, an anytime algorithm Pro-MLCS was designed by using iterative best-first search. The experimental results show that Pro-MLCS could achieve the optimal solution with only3%of the total runnning time. Furthermore, based on Pro-MLCS, two anytime algo-rithms were designed:memory efficient algorithm-SA-MLCS-with small space in-crease rate, and space bounded algorithm, SLA-MLCS. SA-MLCS adopted an iterative beam widening strategy, and could use much less space to find the same quality solution when compared to Pro-MLCS; While SLA-MLCS adopted a replacement strategy, and could continue to find better solutions after SA-MLCS reaches the space bound. The experimental results show SA-MLCS and SLA-MLCS could deal with an order of mag-nitude larger size instances than Pro-MLCS when given the same space bound. At last, we developed two parallel algorithms:DPro-MLCS and DSDPro-MLCS:the former one is for distributed memory architecture and the latter is for hierarchical distributed-shared memory architecture. Both algorithms could achieve linear speedup, and have good anytime property.This dissertation focuses on the methods of dealing with large scale biological sequence data, thus the contributions of this dissertation for three levels are as follows:(1) Data organization:the contribution is the establishment of the global repre-sentation model. For haplotype phasing problem, this dissertation constructed the flow network model, which contains all the information that could be derived from the orig-inal data, which makes the feasible solutions and the flows one-to-one mapping, which would be useful for the algorithm design. For motif search problem, this dissertation adopted a new hashing strategy to reduce the number of stored potential motifs, which brings down the space usage. For longest common subsequence problem, this disserta-tion reorganized the solution space into a graph, and formulated the problem into finding the longest path in the graph, thus the efficient graph search algorithms could be applied.(2) Algorithm development:the contribution is the design of efficient algorithms and anytime algorithms. For haplotype phasing problem, we developed a heuristic al-gorithm FNphasing based on flow network model, which is significantly faster than current algorithms when applied on large scale datasets. For motif search problem, we designed a new pruning method to reduce the access number of hashing table, mak-ing the average time complexity of the new algorithm achieve the best among current algorithms. For longest common subsequence problem, we established an anytime al-gorithm Pro-MLCS and a space bounded anytime algorithm SLA-MLCS. Compared to current algorithms, Pro-MLCS is an order of magnitude faster when achieving the same quality solution and SLA-MLCS could handle an order of magnitude larger size instances when given the same time and space bounds.(3) Parallelization:the contribution is the parallelization of anytime algorithm. This dissertation designed a new parallel strategy to parallelize the anytime algorithm for longest common subsequence problem. The new strategy makes the parallel com-putations of different layers possible. The new parallel algorithms could achieve nearly linear speedup, and have good anytime property. It could make full use of the compu-tational resources of clusters to deal with larger scale datasets.
Keywords/Search Tags:Large scale data, SNP, haplotype, motif, longest common subsequence, flow network, graph search
PDF Full Text Request
Related items