With the rise of emerging large-scale artificial intelligence applications,the demand for computing power for deep neural network accelerators is also increasing.For example,the large language model ChatGPT uses models with up to 175 billion parameters,and virtual reality utilizes a large number of different deep neural network models to perform various subtasks,including speech recognition,image segmentation,and eye tracking.Deep neural network accelerators usually use a systolic array architecture,and their processing can adopt different dataflow methods such as weight stationary,input stationary,and output stationary.Due to the significant differences between different depth neural network models and even between different layers within the same network,they are suitable for systolic arrays of different sizes and dataflows.Using a fixed size systolic array and dataflows on a homogeneous neural network accelerator cluster can lead to insufficient utilization of hardware computing resources and the inability to achieve optimal performance and power consumption.Therefore,when faced with a large number of deep learning tasks,heterogeneous accelerator groups have significant advantages over homogeneous accelerator groups.For heterogeneous accelerator cluster architectures,how to reasonably schedule tasks based on different convolutional neural network characteristics has become an important challenge,with the core of accelerator adaptation and load balancing.In view of the above problems,the following aspects are studied in this paper.(1)This paper proposes a heterogeneous deep learning accelerator cluster architecture,which consists of heterogeneous systolic arrays of different shapes and sizes,and includes backup caches to support task migration.(2)In this paper,the performance and power consumption models of three dataflows(weight stationary,input stationary,and output stationary)systolic array accelerators have been established and verified using Cambrian MLU270-F4 accelerator card hardware.This model has a high accuracy.(3)This paper proposes a scheduling method for heterogeneous systolic array accelerator clusters,HSAS.This method combines the characteristics of load task network architecture with accelerator load balancing,and supports task reordering,preemption,and migration to achieve priority based scheduling.(4)This paper proposes a layer based convolutional neural network task decomposition algorithm that can further decompose and optimize the execution of load tasks,and support more granular scheduling and execution.This paper is cross-verified based on the SCALE-Sim simulator platform and the MLU270-F4 neural network accelerator developed by Cambrian.Experiments show that under the three different dataflow mapping cases of weight stationary dataflow,input stationary dataflow and output stationary dataflow,the heterogeneous systolic array accelerator group proposed in this paper can achieve performance improvement and power consumption reduction with less hardware resource consumption.Compared with the heterogeneous and homogeneous systolic array neural network accelerator groups,the performance is improved by an average of 51%,and the energy-delayproduct is increased by an average of 3.51x.Compared with several common classical scheduling algorithms,the dynamic scheduling algorithm HSAS that supports preemption improves the performance indicators of average normalized delay,system throughput and fairness by 36%,8%,and 55%on average. |