Font Size: a A A

The Study Of Many-core Deep Learning Accelerator Based On BWDSP

Posted on:2019-11-15Degree:MasterType:Thesis
Country:ChinaCandidate:W Q DengFull Text:PDF
GTID:2428330545977033Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The fast increasing of computational capacity of hardware and amount of available data have been considered as two of the key factors of the recent successes of artificial intelligence in computer vision,speech recognition,nature language processing and etc.However,since the size of transistors is getting to the physical limits,how to design high efficient processors to handle explosive growth data is becoming a big challenge.Deep learning accelerators,the domain specific hardware inspired by the characteristics of deep learning is used to accelerate deep learning computation.These hardware shown excellent energy efficiency and many of them can achieve better throughput than GPUs.Many-core architecture is widely adopted by lots of computation infrastructures and accelerators for deep learning.In this thesis,a single BWDSP100 core is taken as the under base computing prototype core and algorithm-hardware co-design method is used,to study the many-core deep learning accelerator.The study is divided to three phases.Firstly,based on the core of BWDSP100,the computing core prototype to consider how to implement and optimize convolutional neural network in single core.Then,studying many-core computing algorithms for convolutional network.Not only the performance of algorithms can be linear scale with increasing of the number of cores,but also the algorithms should facilitate designing on-chip interconnection.Finally,to design the suitable on-chip interconnection adapted many-core computing algorithms.The main contributions of the thesis are the following:1)By taking advantage of the characteristics of convolutional layer and feature of multi-cluster architecture,this thesis proposed a coarse-grained parallel method to compute the convolutional layers.The algorithm given in this thesis is 5.7x faster than conventional vectorization algorithm and 9.5x faster than GEMM-based algorithm com-monly used in GPUs.Performance density of the algorithm achieved in BWDSP100 is 1.55 times than tiled-base algorithm widely adopted in system with cache hierarchies achieved in other platform 66AK2H12.2)An automatic code generating tool was presented to optimize convolutional layer with specific arguments.The code generated by the tool achieved 2.33x to 4.12x faster than general convolutional function.The performance of the convolutional layer optimized by the tool has approached the theoretical performance of the BWDSPI00.3)Based on the abstract parallel model BSP,the many-core computing algorithms for all type of layers in CNN was presented.The performance base on VGG-16C shown the performance of algorithms can be linear scale with increasing of the number of cores.Meanwhile,by reusing data and eliminating redundant transfer to reduce hardware in-dex requirements,only 6 GB/s of bandwidth was required to meet the inter-core data transfer when adopting 56 cores to run the algorithms.So that the algorithms can facil-itate designing simpler and more efficient on-chip interconnection.4)Layer fusion is a important method to optimize the computation of deep learning.It can eliminate data transfer between adjacent layers and accelerate their computation by fusing them into one layer.In this thesis,the technology was extended and can be applied to convolutional and pooling layers of specific parameters,not just element-wise layers.5)By analyzing the characteristics of many-core computing algorithms for CNN the thesis presented,the suitable on-chip interconnection for cores was proposed.The thesis did not adopt the 2D mesh topology network of chip widely used in many many-core processors for on-chip interconnection used to inter-core communication,but bus which is simpler and more efficient.By using only one DMA controller to control all cores data transfer rather than all core sent data independently to avoid bus conflicts and reduce the number of DMA controllers.As a result,in 56 cores configuration,the on-chip interconnection is 12.88x area saving than bus and 15.08x area saving than 2D mesh;it is 2.42x energy saving than bus and is 3.77x energy saving than 2D mesh.6)Multiple buffers can make computation and data transfer parallel to improve the utilization of computing resources and communication resources.Comparing to double buffers widely used by many systems,the rotate triple buffers proposed in this thesis can save 1/4 memory of chips.According to the experiment result,when adopting 56 cores and 896 multipliers and using single-precision floating-point as data elements,the performance of VGG-16C tested in many-core deep learning accelerate proposed in this thesis was 719.12 GFOPS.By keeping the bandwidth of on-chip interconnection,but splitting all 32-bit float-point multipliers to 8-bit multipliers,the performance would be 2.88 TOPS.Since the amount of multipliers of the many-core accelerator is less than FPGA accelerator DLA,ASIC accelerator TPU and NVIDIA GPU K40,its performance is not as good as them.But when comparing the performance of per equivalent 8-bit multipliers,the many-core accelerator is 4.21x better than TPU,2.67x better than DLA and equal to K40,which indicating the proposal is more resources efficient.
Keywords/Search Tags:Deep Learning, Convolutional Neural Network, Hardware Accelerator, Many-core Computing, On-Chip Interconnection, Algorithm-Hardware co-Design
PDF Full Text Request
Related items