With the development of technology and the society,biometric recognition technology was born.Among them,speaker recognition is changeable and forge difficultly,which the data acquisition is relatively convenient and inexpensive,and its low privacy is more easily accepted by the public.This paper mainly studies speaker recognition based on multi-scale feature combined attention,two novel speaker recognition models are proposed: Multi-scale Channel Attention Network(MSCAN)and Multi-scale Temporal Attention Network(MSTAN).The main contributions are as follows:(1)Aiming at the problem that Squeeze-and-excitation Networks(SENet)only considers the relationship between channels and does not pay attention to temporal features,this paper proposes MSCAN model.The MSCAN is mainly composed of self-design CA(Channel Attention)module and MS(Multi-scale)module.The CA module is composed of one-dimensional convolution layer and two-dimensional convolution layer,which is used to capture the spatio-temporal characteristics between channels.The MS module is formed by four dilated convolution with different expansion rates to obtain various receptive fields.In the experiment with the Loss function,this paper proposes the C-AM Loss.That is the Circle Loss used for iteration,and then the AM-Softmax Loss(Angular Margin Softmax Loss)is used.The results show that EER only 3.2% when MSCAN use the C-AM Loss.In the experiment with the SENet,MSCAN is 3.5% higher than SENet in F-measure and EER decreased by 0.3%.In the experiment with other classic speaker recognition models,the performance of MSCAN is still the best.The Equal Error Rates(EER)on Vox Celeb1 and Librispeech datasets are 25.1% and 3.2% respectively.(2)To further resolve the problem of voiceprint feature loss in MSCAN in noisy environment and Res2Net’s inability to process temporal relationship between data,MSTAN model is proposed in this paper.The TA module mainly replaces the 3×3convolution layer in Res2 Net with CA layer,and each filter groups receives the output features of all previous groups to form a full connection structure,so that the temporal information between datas strengthened and the voiceprint features are fully captured.In order to obtain various receptive fields after that,the MS module is applied after the TA module.In the experiment with the Res2 Net,whether using AM-Softmax Loss,Circle Loss or C-AM Loss,MSTAN performance is better than Res2 Net,F-measure is improved by 16.3% at most and EER is reduced by 0.7% at most.Compared with other classic speaker recognition models,MSATA model still has the best performance.The EER on Vox Celeb1 and Librispeech datasets are 22.5% and 3.1% respectively.This paper proposes two speaker recognition models have conducted a large number of experiments on Vox Celeb1 and Librispeech datasets,and have achieved excellent results,which further reflects the effectiveness of MSCAN and MSTAN model. |