Font Size: a A A

Research On Speaker Diarization Based On Deep Learning

Posted on:2021-06-28Degree:MasterType:Thesis
Country:ChinaCandidate:Z F YuanFull Text:PDF
GTID:2428330647457265Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
Speaker diarization is one of the most important technologies in speech signal processing,and it is also an important part of multiple speech technology application systems.It plays a significant role in multi-speaker audio processing.Recently,with the wide application of Deep Neural Network(DNN)in speech field,the speaker diarization has developed rapidly.However,the current research on speaker diarization is not mature enough,and its performance in industrial applications needs to be improved.For the DIHARD challenge,the system has always had a high diarization error rate and poor robustness due to noise,unbalanced speech duration,and overlapping speech.Based on this,this paper analyzes the problems of speaker diarization system based on DIHARD Challenge,and makes intensive research on the network structure,loss function,and speaker clustering algorithm.The specific research results are as follows.Aiming at the problem that the statistical pooling calculation in the speaker feature extraction network does not take into account the differences of each frame of speech,this paper proposes a speaker feature extraction method based on dual self-attention mechanism.First,a multi-head selfattention mechanism is introduced to the speaker feature extraction network based on Time-delay Neural Network(TDNN).The weight coefficients of the speech frames are learned through the multilayer perceptron,and then the weighted mean and variance statistics are calculated to obtain discriminative speaker embedding.Then,in order to make better use of the different levels of speaker information captured by the attention heads,we propose to introduce the self-attention mechanism to the attention heads to further enhance the distinction of speaker embedding.The experiment results have proved that the introduction of the dual self-attention mechanism strengthens the network's extraction of distinguishing speaker information,enhances the x-vector representation ability.When the dimensionality of the linear discriminative analysis(LDA)is 256,system performance has increased by 1.99% on average compared to the baseline.The experiment also studied the impact of LDA dimension and the length normalization of x-vector in the LDA process on system performance.The results show that when the LDA dimension rises to 512 dimensions,the performance of the baseline system and the dual self-attention system are further improved.The normalized x-vector operation only improves the performance of the baseline system,and does not significantly improve the dual self-attention system.Aiming at the problem of insufficient separation between classes and dispersion within the class of the speaker embeddings learned by the cross-entropy with Softmax loss function,this paper proposes two methods to improve the loss function.One is to directly improve the original Softmax function and introduce AM-Softmax,which redefines the classification boundary by adding an additive angle margin to the original Softmax to increase the distance between classes while reducing the distance within the class;the other is to introduce an auxiliary loss function-Center loss,we jointly supervise the network training based on the cross-entropy with Softmax loss function and center loss.On the basis of learning the inter-class difference by the Softmax function,the intra-class distance is compressed through the center loss to optimize the feature space.Experimental research has found that the speaker feature extraction network based on AM-Softmax has poor convergence characteristics during training.This is because the introduction of angle margin actually increases the difficulty of classifying speakers.In order to ensure the normal convergence of the loss,hyperparameters are introduced to simplify the AM-Softmax training.The experiment results show that the simplified AM-Softmax has good convergence characteristics,and under the appropriate optimizer learning rate setting,the system performance is optimal.When the LDA dimension is 512,the average improvement relative to the normalized x-vector baseline by 1.08%.Then,the method of introducing center loss is verified,and the result shows that when the weight of center loss is 0.005,the average increase relative to the normalized x-vector baseline is 0.8% on the condition of 256 dimension of LDA.Aiming at the problem of error accumulation caused by the local optimal solution of the Agglomerative Hierarchical Cluster(AHC)algorithm in the iterative clustering process,this paper considers redefining the speaker clustering task from the perspective of global optimization,and conducts comparative studies from two perspectives.One is regarding the speaker clustering task as an integer linear programming problem,and the global optimal solution is solved by minimizing the objective function;the other is as the optimal segmentation of the graph from the perspective of graph theory.First,we analyze the clustering algorithm based on Hierarchical Integer Linear Programming(HILP)and combine the x-vector and HILP algorithm to solve the speaker clustering task in the DIHARD challenge which integrates multiple audio fields.Second,the spectral clustering algorithm based on Probabilistic Linear Discriminative Analysis(PLDA)score matrix and cosine similarity matrix is studied.The experiment results show that the combination of clustering algorithm based on hierarchical integer linear programming and x-vector can greatly improve system performance,which is 3.74% higher than the baseline system on average.However,the spectral clustering algorithm based on two kinds of score matrices reduces system performance.The reason is that the similarity matrix construction is not reasonable enough.From the perspective of overall system optimization,a speaker diarization system combining dual self-attention mechanism,improved loss function and HILP is constructed.Among them,the structure of the speaker feature extraction network adopts the feature extraction framework based on dual self-attention mechanism,the loss function of the network adopts AM-Softmax,and the clustering algorithm adopts the speaker clustering algorithm based on HILP.The experiment results show that the joint system effectively integrates the advantages of each improved method,and the system performance is greatly improved.When the LDA dimension is 512,it is improved by an average of 2.2% compared to the normalized x-vector baseline system.
Keywords/Search Tags:speaker diarization, multi-head self-attention mechanism, AM-Softmax, center loss, hierarchical integer linear programming
PDF Full Text Request
Related items