With the rapid development of artificial intelligence technology,automatic speech recognition(ASR),as an important way of human-computer interaction,has been widely used in people’s daily life.Based on deep learning,end-to-end ASR models have become the focus of researchers due to their simpler system structure and higher recognition performance.Generally,end-to-end ASR models can obtain performance gains from larger model capacity.The use of Mixture of Experts(MoE)further improves the performance of speech recognition with its advantage of increasing the model capacity.Moreover,MoE can select the appropriate experts for input samples,which has better generalization performance.The introduction of sparse gating mechanism also enables it to increase the amount of parameters without significantly increasing computational cost.Although the MoE models have achieved great success in speech recognition,it still has limitations in the number settings of experts and the selection of experts.On the other hand,due to its large size,the deployment of the model on resource-restricted devices is greatly limited.This thesis mainly focuses on the above problems to further study the MoE speech recognition model.The main research content of the thesis includes the following aspects:(1)TheMoE-based speech recognition model sets the same number of experts in allMoE layers,which fails to fully utilize the different representation capabilities of the deep and shallow layers of the network.To address this problem,this thesis proposes the PyramidMoE speech recognition model,which sets the number of shallow experts to half of the number of deep experts,better leveraging the model’s representation capabilities.The design experiments are based on the MoE-Conformer speech recognition model.The results on the open source data set LibriSpeech show that Pyramid MoE has better recognition performance and less model size thanMoE.(2)TheMoE speech recognition model only selects one expert,which results in lower generalization.However,selecting more experts not only increases computational costs but also increases communication volume.In response to this problem,this thesis proposes the ResidualMoE speech recognition model by setting a fixed extra expert to perform calculations separately from the MoE layer.This model can improve the generalization performance without significantly increasing the communication volume.The experimental results verify the effectiveness of the method.(3)TheMoE model has a huge number of parameters,which greatly limits its deployment on resource-constrained embedded devices.To solve this problem,this thesis investigates the compression method for the MoE speech recognition model.Unlike the existing methods of reducing the amount of parameters by compressing it into a Dense structure or parameter sharing,the method proposed in this thesis compresses the MoE speech recognition model into a binary Dense speech recognition model through knowledge distillation and quantization.In order to realize this compression method,two compression schemes are proposed,namely,two-stage compression and single-stage compression.Further,quantizing weights and activations to low bits can reduce the size of the model while improving computing efficiency.However,the existing speech recognition model quantization method cannot meet the needs of performance and compression at the same time: or the weights and activations cannot be compressed to low bits,which has no obvious acceleration effect on the calculation,or the weights and activations can be compressed to low bits but cause great performance loss.In order to keep the performance undamaged as much as possible,binary weight network represents the weights with 1-bit by minimizing the quantization error,and learned step size quantization represents the activations to 4 bits with learning the more suitable scaling factor required by quantization.The experimental results show that the proposed method can compress the MoE-based speech recognition model 150 times with a small performance loss,and the proposed single-stage compression method can also achieve the comparable performance as the two-stage compression method with a simpler training scheme.Therefore,the proposed method provides a way for the deployment of complexMoE speech recognition models on embedded devices with low memory and computing resources.To sum up,this thesis studies the speech recognition model based onMoE and proposes two improvedMoE structures,namely PyramidMoE and ResidualMoE.Both methods can improve recognition performance,PyramidMoE can also reduce model size and improve parameter efficiency.In addition,this thesis proposes a compression method to compress the MoE speech recognition model into a very small binary Dense model,which is of great significance for the hugeMoE speech recognition model to break through the low resource limitation of embedded device deployment. |