Speaker diarization is a research direction in speech signal processing,which has a wide range of applications in many fields and is one of the current research hotspots in speech processing.In recent years,with the rise of deep learning learning in various fields,speaker segmentation clustering technology has been developed rapidly.Compared with traditional methods,although the accuracy of deep learning-based methods has been improved,there are also disadvantages such as large amount of data and large computation,and there is still much room for improvement.In this paper,we address these problems by starting the research from effective speech detection,speaker segmentation,and speaker clustering.The main contents are as follows:(1)In order to reduce the number of parameters,enhance the reuse of features and avoid overfitting,the network structure of effective speech detection is designed based on Dense Net,and some modifications are made to reduce the complexity of the network and improve the computational speed of the model.For the current situation that the effective speech detection using Log-Mel or MFCC features alone cannot characterize multiple types of voices well,two combined features are obtained by splicing three other acoustic features with Log-Mel and MFCC features,which improves the accuracy of detection compared with Log-Mel or MFCC alone.To further improve the accuracy of effective speech detection,this paper also uses DS evidence theory to perform DS fusion on the output of the two combined features in the softmax layer of the network.Experiments show that the accuracy of DS fusion of the prediction results is higher than that of using the two combined features alone.(2)In speaker segmentation,we use temporal segmentation to slice the speech data into equal length segments with a 50% overlap rate to ensure that each segment contains only a single speaker,and then use the embedding feature extraction network to extract the speaker embedding features from the segments,and then merge the segments with the same adjacent speaker to achieve the purpose of speaker transition point detection and speaker segmentation.In order to alleviate the gradient disappearance problem caused by deepening the network,this paper designs a speaker embedding feature extraction network based on Res Net,and adds an attention mechanism to the network to solve the problem that the relevance of global speech frames is ignored in the speaker embedding feature extraction network,by adding an attention mechanism to the network and modifying the cross-loss function to assign different weights to different feature maps,and enhance the performance of the extracted embedding features for different speakers.The degree of differentiation of the extracted embedding features for different speakers.Under the same experimental conditions,the speaker-embedding features extracted using the Res Net network with the added attention mechanism are more discriminative between the identity information of different speakers.(3)In speaker clustering,the traditional clustering algorithm makes the clustering effect unsatisfactory due to the influence of parameter selection,threshold setting,distribution of data points and large gap between clustering centers.To address these problems,an improved clustering algorithm based on spectral clustering is proposed for speaker clustering.The main purpose is to reconstruct the Laplacian matrix by the similarity matrix,and perform a series of image processing algorithms to improve the operation of this matrix,so that the boundaries between different speakers are more clear.The improved spectral clustering algorithm can estimate the number of speakers by feature intervals and can achieve better clustering quality for data with arbitrary spatial distribution.The performance of the improved spectral clustering and the traditional clustering algorithms are experimentally compared,and the results show that the improved spectral clustering has the best clustering effect among several clustering algorithms.In this paper,by improving the effective speech detection,speaker segmentation and speaker clustering methods,the performance of these modules has been improved compared to the traditional methods,thus making the segmentation and clustering error rate of the speaker diarization system reduced. |