Font Size: a A A

Parallel Algorithms And Architectures For Matrix Computations On FPGA

Posted on:2012-07-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:G M WuFull Text:PDF
GTID:1118330341451753Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Reconfigurable computing systems, such as Field-Programmable Gate Arrays (FPGAs), are based on mapping of applications to customed circuits. As FPGAs have more and more hardware resources, reconfigurable computing field has been in the era of reconfigurable supercomputing. Matrix computations are the key components of scientific and engineering applications. FPGA reconfigurable computing systems have been one of best choices for accelerating these applications. However, there are some challenges to overcome: hardware programming, parallel algorithm design and hardware design optimization. Existing hardware architectures occupied lots of FPGA resource, and had high storage and bandwidth requirement and low scaleablity.This thesis emphasizes on the implementations of matrix computations on FPGAs. In summary, this thesis makes the following contributions:(1) We propose an FPGA design method for basic matrix operations and an architecture for block matrix multiplication. Time space mapping and model construction in matrix computations FPGA design techniques are presented, taking matrix multiplication and matrix-vector multiplication as examples. A preliminary synthesis framework is evaluated in our experiment. Some transformations and optimizations, including loop blocking, are used to derive data transfers and memory optimized block matrix multiplication algorithm. Finally, a high performance and memory efficient implementation of matrix multiplication can be obtained. Experimental results show that our design for matrix multiplication outperforms previous work. Compared to the related work, storage requirement can be decreased to O(b) from O(b2), where b is the data block size.(2) We introduce a fine-grained pipelined LU decomposition algorithm and a linear array of processing elements (PEs) to implement this algorithm. The proposed parallel algorithm can exploit pipeline parallelism and data reused fully. It also can be extended to solve a lower triangular system and a system of linear equations with multiple right-hand sides. The proposed linear array is the core component of the full hardware design implementing the solver for dense systems of linear equations. The linear array can solve column pivoting LU decomposition and lower triangular systems simultaneously. An analytical performance model is built to analyze the performance potentials of the proposed design. Experimental results show that our design outperforms general-purpose processors and previous work.(3) An FPGA-based parallel algorithm and the corresponding architecture for block dense matrix decomposition are presented. Taking non-pivoting LU decomposition as an example, a divide-and-conquer strategy and an FPGA implementation of blocking dense matrix decomposition are proposed. The proposed strategy applies a series of transformations, including loop blocking and space-time mapping, onto sequential non-pivoting LU decomposition. The key idea is to decompose LU into fine-grained computational tasks that can be directly implemented into the processor array on FPGAs and to sequence the execution of these tasks on the processor array. We also propose a high performance and memory efficient hardware architecture to implement our block LU decomposition. Compared to the related work which need two sets of PEs, our design only needs one set of PEs. Moreover, the storage requirement of our design is of O(b), while the related work need O(b2) words of storage, where b is the data block size. The blocking method is extended to a multi-FPGA system and applied to Cholesky decomposition. Experimental results show that our design can achieve a higher computing efficiency than general-purpose processors.(4) We propose two parallel algorithms and two parallel architectures for sparse LU decomposition. The numerical decomposition in sparse LU decomposition is the most time-consuming part of solving sparse systems. The two algorithms, parallel Right-Looking (RL) LU decomposition and parallel Left-Looking (LL) LU decomposition have different advantages. RL LU can reduce data transfers through exploiting data reuse of factors, while LL LU can exploit more parallelisms by dynamic dependence analysis. The data structure of factors is generated in these corresponding architectures dynamically. Experimental results show that LL LU decomposition design outperforms RL one and general-purpose processors.(5) Novel parallel architectures of sparse matrix-vector multiplication (SpMV) and Conjugate Gradient (CG) are introduced. Most execution time of iterative methods for sparse systems is spent on SpMV. SpMV architecture can be applied to FPGA implementation of CG. We propose a sparse matrix blocking method and the corresponding storage format, which is suitable to FPGA design. SpMV designs based on the storage formant can process any large sparse matrix without any parameter change, compared to the related work. One of the two proposed SpMV designs can reduce zero-filling efficiently. Experimental results show that the proposed SpMV designs outperform the related work and general-purpose processors and the proposed CG design also has higher performance than general-purpose processors.
Keywords/Search Tags:Field-Programmable Gate Array, reconfigurable computing, matrix computations, systems of linear equations, matrix multiplication, dense LU decomposition, sparse LU decomposition, sparse matrix vector multiplication
PDF Full Text Request
Related items