Speech processing models such as speech recognition and speaker verification are widely used in intelligent conference and customer service conversation processing scenarios due to their low cost and high efficiency.However,suppose the unprocessed audio is manipulated.In that case,other factors such as the environment or noise will affect the recognition effect.The phenomenon of continuous speech by different people will limit the speech model.These adverse effects will lead to the performance of the speech processing model decline.The speaker diarization task can generate corresponding more structured text information according to the audio signal.Other speech processing models can process the information of each speaker in its corresponding utterance time.This thesis deeply studies each speaker diarization module and designs a speaker diarization model based on a hybrid deep neural network.The specific work as follows:1.Aiming at the problem that the speech segment obtained by speech segmentation needs to have only one speaker information,taking voice signals and spectrograms as objects,this thesis designs a bidirectional long-term and short-term memory neural network to build a speech segmentation model,and uses the time information of speech signals to establish dependencies in the time context.The speech segment was segmented according to the classification consequence.Compared to the energy-based active speech detection model,the false alarm and missed detection rates were reduced by 2.42% and 1.2%,respectively,and at the same time,the speaker discrepancy analysis error rate was reduced by an average of 5%.2.Aiming at the low quality of the speaker’s feature of voiceprint vector,a model based on a deep residual alternating convolutional neural network was constructed,the residual connection was used to utilize the original information,the problem of gradient disappearance/explosion was avoided,and the training effect of the model was enhanced.At the same time,compared with the speaker diarization models based on Res Net,Res Ne Xt,X-vector and other networks,local and global attention mechanisms are used to perform feature recalibration,which improves the quality of speaker voiceprint feature vectors and reduces the purpose of speaker difference analysis error rate.In addition,the lightweight design was added in the model construction process,so that the number of model parameters reached 7.61 million,and the final diarization error rate was 4.12% on the Voxconverse dataset and 7.34% on the AMI dataset.3.In order to solve the problem of how to obtain the appropriate number of speaker categories by clustering without prior knowledge of the number of speakers,affinity propagation clustering is introduced into the system,and a complete speaker difference analysis model is constructed by combining the Bi-LSTM-based speech segmentation model and the deep residual alternating convolutional neural network..Compared with hierarchical clustering or spectral clustering,the error rate is reduced by1.1% on average,and it breaks through the limitation of the prior knowledge of the number of classifications and realizes the clustering of the number of random speakers. |