| Speaker recognition is the task of identifying persons from their voices by virtual of speech signal processing method,and it has a broad application prospect in network communication,consumer electronics,intelligent terminal,human-computer interaction,secure payment and other fields.There are three major branches of speaker recognition:speaker identification,speaker verification and speaker diarization.Speaker recognition can also be divided into text-independent and text-dependent according to text information.In recent years,with the development of deep learning theory,speaker recognition technology based on deep neural network has made new progress,but its large amount of parameters and computation cost limit its application in embedded systems.In addition,its recognition performance still needs to be improved.In this thesis,deep learning theory is applied to study the text independent speaker recognition method based on deep learning theory.The main work is as follows:(1)To reduce the requirement of high computational resources for existing methods,an efficient speaker recognition model based on TSCA-Res MBConv structure is proposed.In this model,the number of parameters is greatly reduced by introducing Fused MBConv and MBConv structures.The TSCA module is proposed in consideration of the fact that speakers may produce more characteristic sounds during certain periods of time than other periods of time.Compared with SE module,TSCA can establish the association between channel information and time segment,thus improving the recognition performance of the model.Experimental results show that the TSCA-Res MBConv structure proposed in this thesis can achieve better recognition performance than the benchmark method with fewer parameters.(2)Considering the strong correlation between speaker verification and speaker identification,as well as the characteristics of time pooling layer in the general structure of speaker recognition network,a time pooling method combined with multi-task learning scheme is proposed to improve recognition performance.In this method,multi-task learning strategy is applied to make the query vector of self-attention mechanism adopted by time pooling layer have more effective information.In the process of model training,the Triplet considering sample pairs and AAM-Softmax considering category information are combined to construct a network model containing speaker verification task and speaker identification task.Experiments show that this scheme can improve the recognition performance of speaker verification task effectively. |