Font Size: a A A

Research On Pattern Matching And Frequent Pattern Mining Algorithm Based On CUDA

Posted on:2016-06-12Degree:MasterType:Thesis
Country:ChinaCandidate:S F DuFull Text:PDF
GTID:2208330461487612Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Pattern matching and finding frequent patterns over a character sequence is a hot topic in computer science and various interdisciplinary. However, as a specific character sequence in bioinformatics, gene mutation becomes the motive power of algorithms for the problem of approximate string matching; frequent patterns in biological sequence usually have important biological significance, and the existing algorithms often are ignored those. The data expansion causes the bottleneck for the performance of serial algorithms optimization, CUDA is a general purpose parallel computing platform on GPU which is so cheap but strong in parallel processing. Thus, it is theoretical and practical significance for CUDA acceleration of the problem of approximate string matching and the problem of finding frequent patterns in biological sequence.Firstly, this paper introduces CUDA, Thrust primitives and CUDA programming optimization technology. CUDA programming optimization mainly includes memory access optimization and I/O optimization. For the problem of approximate string matching with k-difference, this paper realizes CUDA optimization based on parallel algorithm DASMP, which make full use of GPU hardware acceleration. Considering the limition of the size of the memory, this paper also presents the division on large sequence under multi-GPU environment to cover the failure of dealing it. Experiments in this paper show that compared with the un-optimization parallel algorithm before, optimized parallel algorithm can achieve 20%-50% promotion; compared with the traditional serial algorithm, the optimized parallel algorithm can get 5-32 speedup, and accompanied by the increase of the amount of experimental data, the performance advantage promotion is more and more obvious.Secondly, this paper focus on CUDA acceleration of finding frequent patterns with constraints in the longer single sequence, and the realization of CUDA programming optimization to further strengthen the performance improvement. This paper first redefines a single sequence with constrained [m1, m2]-frequent pattern which means the target frequent patterns of length interval. This constraint conditions meet the special requirements of biological information sequence, to avoid the traditional algorithm from the 1 set to start the search and brought a larger consumption of space and time. This paper proposes serial algorithm POSA with a new pruning strategy flag array pruning method. Experiments show that compared with traditional serial algorithm, the POSA obtained 1.2-4.5 speedup. And then, this paper presents a parallel algorithm POPA on CUDA based on the POSA to furtherly improve the efficiency of searching, which can obtain the acceleration of 3-20.Finally, this paper studies the use of CUDA to accelerate finding tandem repeats in a long single sequence, and the Thrust library and CUDA programming optimization to further strengthen the performance improvement. This paper studies a special kind of tandem repeat LPR(Largest Pattern Repetition), a new index array based on SUA in the DNA sequence to find this kind of maximal tandem repeats LPRs. Considering the characteristics of DNA sequence character set, this paper use the binary code of the implementation on parallel serial compression on CUDA. The use of Thrust parallel primitive library better help CUDA accelerate the establishment of SUA index and the search for LPRs. Experiments show that compared with traditional serial algorithm, parallel algorithm on the overall speedup ratio can reach 1.6-5.4, and compared with parallel algorithm on cluster, it still performs better.
Keywords/Search Tags:Pattern matching, Frequent pattern, Pattern repetition, CUDA optimization
PDF Full Text Request
Related items