Font Size: a A A

High Performance Algorithm For Pairwise Statistical Significance Estimation

Posted on:2013-01-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y H ZhangFull Text:PDF
GTID:1118330374486906Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The past decades have witnessed the trends of explosive growth in the quantity andvariety of publicly available biological sequence data. Dealing with this tsunami of datapouring from the sequencing factories and mining the meaningful information hiddenfrom these data, are big challenges in Bioinformatics. Another challenge coming fromthe high performance computing (HPC) platform is that there exists a lack of parallelsoftware. Because of this lack of parallel software, it is hard to harvest computationalpower of HPC platform. As a result, it is imperative to design HPC software inBioinformatics to accelerate the analysis and processing of large data sets of biologicalsequence data.Compared to the traditional database statistical significance estimation (DSSE),experimental results show that pairwise statistical significance estimation (PSSE)enables to detect related sequences (i.e., homolog) more accurately. However, due to itsinherent characteristic of computationally and data intensive, PSSE is verytime-consuming. In addition, it is demanded that the accuracy of PSSE be furtherimproved. Based on the above observations, a group of high performance algorithms forPSSE are studied deeply in this dissertation and the corresponding innovative results areas follows:1. A high performance algorithm for PSSE is proposed which is based on theNon-Uniform Memory Access (NUMA) multi-core processor. To avoid the weakness ofAmdahl model in era of multicore, an improved performance estimation mode ispresented in this dissertation, which takes the overheads (for example, threadsynchronization, data copying and memory access latency) into account. This modeltherefore helps to identify performance bottlenecks in the estimation system as a whole.Careful analysis shows that PSSE can be decomposed into three computation kernels:permutation, alignment and fitting. The complexity algorithm is optimized in terms oftime and space. The running parameters are tuned properly as well to avoid NUMAeffect. The experimental results show that the maximum speedup over sequentialimplementation is22.65×when using24processor cores. 2. A new algorithm is designed for PSSE which is based on the hybrid structure ofthe distributed computational nodes with NUMA multicore processor. The algorithmemploys large-scale computational nodes in distributed memory environment to profitperformance and saves memory consumption through shared memory mechanismamong cores within one node. Furthermore, a new schedule algorithm for loadbalancing is proposed, which is based on the total length of sequences, instead of thenumber of sequences. With the optimizations described above, the hybridimplementation achieves speedups (over the corresponding sequential implementation)621.73×when using1024cores across different43nodes.3. This dissertation presents an adaptively tile-based algorithm for PSSE which isbased on general-purpose graphics processor unit (GPGPU). In this algorithm, thenumber of subject sequences which are required to transfer to GPU global memory, isself-tuning according to GPU hardware configuration, resulting in a high occupancy ofGPU. In addition, the layout for global memory is reorganized to store the subjectsequence and its thousands permutated copies, which results in coalescing access. Theexperimental results show that an end-to-end speedup of369.95×using dual-TeslaC2050over the CPU implementation is achieved, and its peak GCUPS is up to25.50.4. A group of strategies, including customizable substitution score matrix,alignment score-censored parameters fitting, etc., are proposed, which can be used toimprove the accuracy of PSSE significantly. The sequence specific substitution andposition-specific substitution matrix are fulfilled to replace the standard substitutionmatrices during PSSE, in which the maximum characteristics of the query sequence canbe used to obtain a more accurate score. In addition, the maximum likelihood estimationwith censoring left peak is used to fit the statistical parameters of an extreme valuedistribution (EVD), which significantly enhances the fitting accuracy of statisticalparameter. Through these optimizations of the above scoring strategies, compared to theDSSE offered by BLAST and SSEARCH, PSSE can improve the Coverage35%-40%at the same level of Error per Query.As the size of biological sequences is increasing rapidly, even more powerful highperformance computing accelerator platforms, consisting of heterogeneous componentssuch as multi-core CPU along with general-purpose GPGPUs (or GPU clusters) andpossibly FPGAs, are expected to be more and more common and imperative for sequence analysis, for which our work can serve as a meaningful stepping stone.
Keywords/Search Tags:pairwise statistical significance estimation, biological sequence analysis, scoring strategy, high performance computing, parallel algorithm
PDF Full Text Request
Related items