| In recent years,as artificial intelligence technology has developed,the application scenarios of identity authentication have become more and more complex.Speaker recognition has received widespread attention and research for its high reliability and convenience.Currently,deep learning-based speaker recognition models have shown excellent performance.However,in some practical scenarios,it is difficult to obtain a large amount of speech data with long duration.Short-duration speech severely affects the accuracy and model robustness of speaker recognition.In order to solve the above problems,the thesis proposes a speaker recognition method based on acoustic feature enhancement and multi-scale feature fusion,which is improved in terms of both acoustic features and network structure.The main research contents of the thesis are as follows:(1)To address the problem that the discriminative features of a single acoustic feature are sparse in short-duration speech scenarios,an adaptive feature selection module is introduced to select a subset of features that contribute more to the speaker recognition task and enhance the discriminability of acoustic features.In the adaptive feature selection module,firstly,different acoustic features are extracted from the same speech data: the linear predictive cepstral coefficient(LPCC),the Mel filter bank(FBank)and the perceptual linear predictive cepstral coefficient(PLPCC).Then,the three features are concatenated in series,and the subset of features that contribute more to the speaker recognition task is adaptively selected using an attention network,thus removing redundant features and information irrelevant to the speaker’s identity.Experimental results show that compared with the single acoustic feature FBank,the proposed method reduces the equal error rate(EER)value and the minimum detection cost function(min DCF)value by 8.35% and 8.34%,respectively,after introducing the adaptive feature selection module.(2)Aiming at the problem that the speaker information is insufficient in short-duration speech scenarios,a multi-scale feature fusion network named OSA-SA-FPNet is designed to compensate the missing speaker details in speaker embeddings through shallow feature reuse.In the multi-scale feature fusion network,firstly,the speech frame-level features are extracted by the one-shot aggregation(OSA)module.At the last layer of the OSA module,all previous shallow features and deep features are concatenated together at once to more efficiently fuse speaker details in the shallow features.Then,the subspace attention(SA)module is inserted at the end of the OSA module to explicitly learn the complex information interactions of the channel dimensions.Finally,the multi-scale aggregation(FP)module is constructed to aggregates frame-level features of different scales and enrich the speaker information in the speaker embeddings.The experimental results show that compared with the recent model ResNet34-SE,the proposed OSA-SA-FPNet reduces the EER value and the min DCF value by 18.66%and 18.14% respectively when the input acoustic feature is FBank.Finally,the adaptive feature selection module is introduced into the multi-scale feature fusion network OSA-SA-FPNet.Compared with the model ResNet34-SE,the OSA-SA-FPNet with adaptive feature selection module reduces the EER value by 18.23% and the min DCF value by 10.49%.In short-duration speech scenarios,the speaker recognition accuracy of the proposed method consistently outperforms the model ResNet34-SE. |