Font Size: a A A

Architecture, Mapping Algorithms and Physical Design of Mesh-of-Functional-Units FPGA Overlays for Pipelined Execution of Data Flow Graph

Posted on:2018-01-10Degree:Ph.DType:Dissertation
University:University of Toronto (Canada)Candidate:Capalija, DavorFull Text:PDF
GTID:1478390020957574Subject:Computer Engineering
Abstract/Summary:
FPGAs can deliver high performance but their programmability wall hinders widespread use: they require hardware expertise and their CAD tools have long compile times. We tackle this challenge by exploring overlays: pre-compiled FPGA circuits that are themselves programmable via software-familiar models without FPGA CAD tools.;We propose a high-performance mesh-of-functional-units overlay architecture that projects a model of pipelined execution of data flow graphs (DFGs). It consists of cells, each containing a functional unit (FU) and routing logic, with elastic pipelines and FIFOs in every routing hop. The architecture realizes latency insensitive data-driven execution, facilitates high Fmax and scales to large mesh sizes. We design a DFG-to-overlay mapping algorithm that places, routes, and balances DFGs on the overlay for high throughput. We also propose a bottom-up CAD flow based on partitioning and floorplanning of an overlay into tiles. The flow maintains high Fmax for large overlays and enables parallel compilation and quick stitching of tiles from a pre-compiled library.;We prototype two overlays on a Stratix IV FPGA that has 212K ALMs: a 355 MHz 24x16 integer overlay and a 312 MHz 18x16 floating-point overlay. We map 16 DFGs and show that the two overlays deliver throughput of up to 37 GOPS and 22 GFLOPS, respectively. The DFG mapping is fast, taking less than 7 seconds. The tile-based bottom-up flow achieves 37% higher Fmax than the flat flow (the default CAD flow), with only 8% more resources. Compared to the flat flow, which compiles an overlay in 4 hours, the bottom-up flow stitches together an overlay from pre-compiled tiles in 35 minutes and can compile one from scratch in one hour, if tiles are compiled in parallel.;The resource overhead of implementing the DFGs on the floating-point and integer overlays as opposed to compiling them directly to the FPGA averages to 4x and 9x, respectively. The compile time across the DFGs is reduced by 1500x.;Our work demonstrates the feasibility of designing high-performance overlays that project a software-familiar programming model, scale with increasing FPGA resources and provide a considerable reduction in compile time. These benefits come with a modest resource overhead.
Keywords/Search Tags:FPGA, Flow, Overlays, CAD, Architecture, Mapping, Execution, Compile
Related items