| With the rapid development of artificial intelligence technology,intelligent speech recognition has made significant achievements in various languages.However,in daily life,the existence of accents and dialects is inevitable due to factors such as geography and education level.The data sets for these accents and dialects are diverse and very scarce,making direct model training and recognition very difficult.How to solve this problem has become a common challenge for global intelligent speech technology companies.This article focuses on dialect speech recognition research based on the Transformer model,with attention mechanism as the core.The following work has been carried out:(1)Introduce weighted attention mechanism into the Transformer model for accent speech recognition.Due to the influence of speech habits,namely accents in different regions,the accuracy of accent speech recognition systems is relatively low.In this paper,a weighted attention mechanism is used instead of the traditional self-attention mechanism in the Transformer,allowing the model to simultaneously learn feature information in different subspaces,better capture accent features,distinguish different types of accent features,and improve the accuracy of accent speech recognition.(2)Analyze the working method of multi-head attention mechanism in speech recognition tasks.Attention mechanism has made great progress in various speech-related tasks.However,people often lack a deep exploration of its working principle,and understanding the working method of attention mechanism is crucial for its improvement and application.This paper studies the function and importance of different attention heads in speech recognition tasks through methods such as adding trainable weights,pruning,classification,and visualization.(3)To adapt to the characteristics of speech signal features,a variant of the traditional self-attention mechanism,the "diagonal and vertical self-attention mechanism," is proposed.The attention mechanism-based Transformer model was initially proposed for machine translation tasks.However,unlike text features,speech signal features frames may contain repeated or silent information,so the traditional attention mechanism may not be entirely applicable to speech recognition tasks.To address this problem,this paper proposes a novel sparse attention mechanism,the "diagonal and vertical self-attention mechanism," by studying the working method of the attention mechanism in speech recognition tasks and redesigning the operation process of the attention mechanism.This method greatly alleviates the quadratic computational complexity problem of the traditional attention mechanism,reduces the memory consumption of calculations,and significantly improves the performance of the model.At the same time,this method reduces the number of encoder layers by 1/3 without affecting the model’s performance,greatly reducing the model’s complexity,training,and inference time. |