High Performance Algorithm For Pairwise Statistical Significance Estimation

Posted on:2013-01-03

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y H Zhang

Full Text:PDF

GTID:1118330374486906

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

The past decades have witnessed the trends of explosive growth in the quantity andvariety of publicly available biological sequence data. Dealing with this tsunami of datapouring from the sequencing factories and mining the meaningful information hiddenfrom these data, are big challenges in Bioinformatics. Another challenge coming fromthe high performance computing (HPC) platform is that there exists a lack of parallelsoftware. Because of this lack of parallel software, it is hard to harvest computationalpower of HPC platform. As a result, it is imperative to design HPC software inBioinformatics to accelerate the analysis and processing of large data sets of biologicalsequence data.Compared to the traditional database statistical significance estimation (DSSE),experimental results show that pairwise statistical significance estimation (PSSE)enables to detect related sequences (i.e., homolog) more accurately. However, due to itsinherent characteristic of computationally and data intensive, PSSE is verytime-consuming. In addition, it is demanded that the accuracy of PSSE be furtherimproved. Based on the above observations, a group of high performance algorithms forPSSE are studied deeply in this dissertation and the corresponding innovative results areas follows:1. A high performance algorithm for PSSE is proposed which is based on theNon-Uniform Memory Access (NUMA) multi-core processor. To avoid the weakness ofAmdahl model in era of multicore, an improved performance estimation mode ispresented in this dissertation, which takes the overheads (for example, threadsynchronization, data copying and memory access latency) into account. This modeltherefore helps to identify performance bottlenecks in the estimation system as a whole.Careful analysis shows that PSSE can be decomposed into three computation kernels:permutation, alignment and fitting. The complexity algorithm is optimized in terms oftime and space. The running parameters are tuned properly as well to avoid NUMAeffect. The experimental results show that the maximum speedup over sequentialimplementation is22.65Ã—when using24processor cores. 2. A new algorithm is designed for PSSE which is based on the hybrid structure ofthe distributed computational nodes with NUMA multicore processor. The algorithmemploys large-scale computational nodes in distributed memory environment to profitperformance and saves memory consumption through shared memory mechanismamong cores within one node. Furthermore, a new schedule algorithm for loadbalancing is proposed, which is based on the total length of sequences, instead of thenumber of sequences. With the optimizations described above, the hybridimplementation achieves speedups (over the corresponding sequential implementation)621.73Ã—when using1024cores across different43nodes.3. This dissertation presents an adaptively tile-based algorithm for PSSE which isbased on general-purpose graphics processor unit (GPGPU). In this algorithm, thenumber of subject sequences which are required to transfer to GPU global memory, isself-tuning according to GPU hardware configuration, resulting in a high occupancy ofGPU. In addition, the layout for global memory is reorganized to store the subjectsequence and its thousands permutated copies, which results in coalescing access. Theexperimental results show that an end-to-end speedup of369.95Ã—using dual-TeslaC2050over the CPU implementation is achieved, and its peak GCUPS is up to25.50.4. A group of strategies, including customizable substitution score matrix,alignment score-censored parameters fitting, etc., are proposed, which can be used toimprove the accuracy of PSSE significantly. The sequence specific substitution andposition-specific substitution matrix are fulfilled to replace the standard substitutionmatrices during PSSE, in which the maximum characteristics of the query sequence canbe used to obtain a more accurate score. In addition, the maximum likelihood estimationwith censoring left peak is used to fit the statistical parameters of an extreme valuedistribution (EVD), which significantly enhances the fitting accuracy of statisticalparameter. Through these optimizations of the above scoring strategies, compared to theDSSE offered by BLAST and SSEARCH, PSSE can improve the Coverage35%-40%at the same level of Error per Query.As the size of biological sequences is increasing rapidly, even more powerful highperformance computing accelerator platforms, consisting of heterogeneous componentssuch as multi-core CPU along with general-purpose GPGPUs (or GPU clusters) andpossibly FPGAs, are expected to be more and more common and imperative for sequence analysis, for which our work can serve as a meaningful stepping stone.

Keywords/Search Tags:

pairwise statistical significance estimation, biological sequence analysis, scoring strategy, high performance computing, parallel algorithm

PDF Full Text Request

Related items

1	Sequence-specific sequence comparison using pairwise statistical significance
2	Biological Sequence Alignment Problem
3	Parallel Clustering Algorithm For Large-scale Biological Data Sets
4	Research On Key Technologies Of Parallel Optimization For Biological Sequence Analysis Algorithm Based On CPU+GPU Heterogeneous System
5	Application Of Linear Space Algorithm For Pairwise Sequence Alignment Based On Parallel-Computing
6	Design And Optimization Of Parallel Algorithm For Biolgogical Sequences
7	Genetic Algorithm In Financial High Performance Computing
8	Design And Research On A Parallel Performance Data Collection,Representation And Analysis Framwork For The SMP-Cluster Architecture
9	Research On Parallel Algorithms For Biological Sequence Analysis On Multi-core CPU And GPU
10	Research Of High Performance Evolutionary Algorithm Based On Distributed Parallel Computing