Research Of High Performance Bit-parallel Longest Common Subsequence Algorithm

Posted on:2019-08-06

Degree:Master

Type:Thesis

Country:China

Candidate:X N Wang

Full Text:PDF

GTID:2370330545953706

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The LCS problem is the problem of finding the longest subsequence in a data set(usually two sequences)(without changing sequence order).The longest common subsequence problem is a classical problem in many fields,such as computational biology,data Mining,text editing and other fields,as well as data comparison programs and bioinformatics applications,similar to the biological sequence similarity measurement problem.The current solution to solve the LCS problem is based on dynaMIC programming.As the biological sequence databases are growing continuously,parallel sequence comparison algorithms are becoming increasingly important.Based on the bit-parallel LCS algorithm,each character is mapped to a bits of the computer word,which reduces the scale of the matrix,saves the storage space,and operates on 32 characters,reduces the number of operations and achieves a fine granularity parallelism.Although the calculation speed is relatively higher,but with the production of large data,more and more data need to be processed,only using CPU to calculate the speed of data generation,using High-performance Computing Accelerator card acceleration is imminent.The computer cluster based on KNC and KNL accelerated card has a greater advantage,which allows us to achieve a greater degree of parallelism and achieve higher program performance.In this paper,we present XLCS,a new parallel implementation to accelerate the LCS algorithm on Xeon Phi clusters by performing bit-wise operations.We design an asynchronous data transmission and calculation framework,which can cover up most of IO communication,and in a single node,use various optimization techniques,such as quantization,OpenMP,etc.to improve the performance of the algorithm.In the cluster multi node,using MPI to carry on the multithreading communication,the cluster node is divided into the main node and the compute node,the main node is responsible for the data distribution preprocessing,the result collection processing,the computation node is responsible for the computation,the computation completes returns the result to the main node,at the same time the Through the above technical means,make full use of distributed resources,realize the optimization of the LCS algorithm on the cluster.Our performance evaluation shows that XLCS achieves a peak performance of 3.61 TCUPS on a single Xeon Phi 31S1P card and 8.20 TCUPS on a single Xeon Phi 7210 card.Compared to the wellparallelized LCS algorithm OCS implemented on three M2090 cards,XLCS achieves a speedup of 3.6�on a KNC and 8.1�on a KNL,respectively.XLCS further achieves 93.84 TCUPS running 16 compute nodes of the Tianhe-2 supercomputer and 312.30 TCUPS with 32 compute nodes of a KNL cluster.To our knowledge,this is the first reported implementation of the bit-parallel LCS algorithm on Xeon Phi clusters.XLCS is availabe at https://github.com/wxiaoning/LongestCommon-Subsequence.

Keywords/Search Tags:

Longest Common Subsequence, Bit-Parallel Algorithm, Cluster Computing, Many-core Architectures

PDF Full Text Request

Related items

1	Parallel DNA Read Mapping Algorithms On Multi-core And Many-core Architecture
2	Parallel Machine Scheduleing With Common Due Date To Maximize Early Work
3	Parallel Collaborative Algorithm For Large-Scale LBM Multiphase Flow On Heterogeneous Many-Core Platform
4	Research On BWA-MEM Core Algorithm Based On GPU Acceleration
5	High Performance Algorithms And Models For Large-scale Biological Sequence Analysis
6	Analysis And Modeling Of A Kind Of Key Problems In Biological Sequence
7	GPU Many-core Parallel Computing Of Broadband Typical Adaptive Beamforming For Acoustic Signal Processing
8	Research On Efficient Time Series Clustering Algorithms
9	Study On The Parallel Implemention And Application Of 3D Reconstruction Algorithm With GPU Cluster
10	Parallel Design And Optimization Of A Galaxy Group Finding Algorithm On SGI Vs. Distributed-memory Cluster