A Dram-based Processing-in-memory CNN Accelerator

Posted on:2020-07-30

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Q Deng

Full Text:PDF

GTID:1488306548991619

Subject:Electronic Science and Technology

Abstract/Summary:

PDF Full Text Request

Convolutional Neural Networks(CNNs)have made great progress in recent years,which are widely used to address image classification and pattern recognition problems.The error rate of CNN based visual recognition decreased from 28% in2010 to 3% in 2016,surpassing human-level performance at 5%.However,large CNNs could have millions of parameters and require up to tens of billions of operations for processing one image frame.The proliferation of Convolutional Neural Networks(CNNs)has motivated the development of novel ASIC CNN accelerators to improve their inference as well as training performance.Most such accelerators focus on achieving a good trade-off between improving computation efficiency and alleviating memory constraints.When exploiting mature integrated circuit designs to construct high performance computing blocks,CNN accelerators often face tight memory bandwidth and capacity constraints.Alternatively,memory-centric designs prioritize the memory subsystem to construct processing in memory(PIM)to overcome the memory wall problem,which is one of the most promising architectures of CNN accelerators.By avoiding massive data movement via integrating the processing unit and the memory together,PIM designs strive to achieve a balance between computation efficiency and memory performance.Dynamic random access memory stands out from the storage media candidates with its high integration density,mature production technology and rich potential computing resources.However,it is challenging to support CNN in DRAM-based PIM designs.The mature implementation of DRAM throttles the accelerating capability for CNNs.There lacks powerful arithmetic units and DRAM-friendly high level optimizations to unleash the full computing potential of DRAM.In this paper,we propose LAcc,i.e.,a DRAM-based PIM accelerator that supports vector multiplication.Our contributions are as follows:� We propose a DRAM-based carry-look-ahead adder in LAcc.The implementation exploits the in-DRAM bit operation together with our enhancements.We implement ternary weight neural network to achieve inference accuracy and energy efficiency.Our experimental results show that Dr Acc achieves 84.8 FPS(frame per second)at 2W and a 2.9� power consumption improvement over the process-near-memory design.� We propose a LUT-based vector multiplication approach in LAcc.The proposed design leverages decomposed multiplication to decrease the LUT size and makes a trade-off between LUT reuse and pre-calculation.We propose a valued-based encoding to further decrease the overhead of LUT,and propose an optimized XOR operation to accelerate addition operations in DRAM.Our experimental results show that LAcc improves the multiplication performance 6.8� over the ideal baseline,and achieves 6.3� efficiency improvement over the state-of-the-art without accuracy loss.� We propose a flexible data partition strategies for LAcc to dynamically match resource demands.After we analysis the effect of data partition,we propose three model,i.e.,the high throughput mode,the single frame mode and the high power efficiency mode.They are specially designed to maximize system throughput,minimize single frame processing time,and minimize power consumption,respectively.Our experimental results show that the proposed design improves throughput by 57% and the latency of single image reaches 0.27 s.� We propose a hybrid mapping of weights and inputs for DRAM and further study the hardware utilization and the page parallelism when adopting different addends and batch sizes.Our experimental results show that the proposed design improves the average throughput by 12.4% than other data mapping method and the hardware utilization rate by 10%.� We propose a LUT-based approximate vector multiplication to further accelerate the LUT-based vector multiplication,which efficiently decreases the performance bottleneck,i.e.LUT pre-calculation overhead,in emerging CNNs.We implement a CAM in DRAM to support fast and parallel look up operation.Our experimental results show that the proposed design improve the performance of Mobilenet by 2.3X.� We propose an optimization on value �0' in DRAM.The optimization contains data �0' and the �0's in the binary encoding.We leverage a binary tree reording method to group more �0's in the binary encoding and implement the reorder unit in DRAM.Our experimental results show the proposed optimization method improves performance of Alexnet by 34%.When combined with approximate vector multiplication,the maximum performance improvement reaches 32 X.

Keywords/Search Tags:

PIM, CNN, Accelerator, DRAM

PDF Full Text Request

Related items

1	Modeling and design of high-performance and power-efficient 3D dram architectures
2	Architectural Applications of Radio Frequency Interconnect for Chip-to-DRAM Communication
3	Research And Implementation On The Key Techniques Of High Efficiency Computational DRAM Architecture
4	Thin-Body SOI Capacitorless DRAM Cell Design Optimization and Scaling
5	Hardening Design Of Single Event Upset In DRAM
6	Electrical and dielectric properties of DRAM capacitors
7	Design And Implementation Of DRAM PUF
8	Power-saving method for DRAM/eDRAM and 3D-DRAM exploiting the process variations, temperature changes, device degradation, and memory access workload variations and innovative heterogeneous memory management approach using 3D-DRAM with Quality of Service
9	Design Of Dram Refresh Clock Generation Circuit Based On Temperature
10	Cmos Process To Improve The Dram Hold Time