Font Size: a A A

Research On Voceprint Signal Clustering Algorithm For Speaker Segmentation

Posted on:2024-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:X L ZhangFull Text:PDF
GTID:2568307151459714Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
As an important topic in the field of voiceprint,Speaker Diarization mainly focuses on how to determine the starting and ending time of a single speaker in a continuous audio stream containing multiple speakers,thereby achieving "purification" of continuous speech streams to effectively improve the performance of speech recognition systems.In recent years,with the widespread application of deep learning in the field of voiceprint,speaker segmentation and clustering technology has developed rapidly.However,due to the impact of background noise,reverberation,channel differences,and other factors,the error rate of speaker segmentation and clustering is high,and the robustness of the entire system is poor,unable to meet application requirements.For this reason,this article conducts in-depth research on asynchronous voiceprint segmentation and clustering in conference scenes,improving the mainstream voiceprint model ECAPA-TDNN and clustering algorithms.The main research contents of this article are as follows:(1)A multi-scale channel separation convolutional module based on parallel attention mechanism is proposed to slice and splicing channel features in HS-Res2 Net module in the form of full connection,so that non-adjacent channel features can be mapped directly to each other,and multi-scale voicing features can be represented in a more fine-grained way.The multi-scale features are expressed through the parallel attention mechanism of space and channel.While obtaining the key information of channel features,the weight information is established based on spatial position,and the local features are paid more attention to,so as to enhance the feature expression ability.(2)In view of the fact that the AAM-Softmax loss function gains relative intra class compactness and inter class dispersion at the cost of a fixed additional angle margin m to increase training difficulty,but cannot achieve a dynamic adaptive change to the classification boundary,this paper considers using the opposite characteristics of the sine and cosine functions to achieve cosine margin compensation for the AAM-Softmax loss function,and adjusts the class spacing based on the similarity of the feature samples on the original classification boundary,This can better compare the speech specificity of different speakers and improve the robustness of speaker segmentation and clustering systems in complex scenarios.(3)Aiming at the problem that the cosine Angle is used to calculate the feature similarity while the absolute distance value is ignored in the voicing spectrum clustering algorithm,a bidirectional cosine similarity is proposed to measure the similarity between the voicing feature data objects from two perspectives of direction and value.While retaining the discriminant advantage in the direction and Angle,distance differences in each dimension of the data points are considered.The optimized spectral clustering algorithm is further improved to improve the performance of voicing segmentation clustering system.(4)Aiming at the problem that the spectral clustering algorithm relies heavily on the similar matrix,a Markov clustering algorithm with traction factor is proposed to reconstruct the similar matrix,the similarity is converted into probability value,and the initial probability matrix and the state probability matrix of the front end are used as traction to regulate and restrict the current state probability information,so as to prevent it from wandering to the "wrong class".The expansion and expansion operations were carried out continuously until the transition state probability matrix tended to be stable,and then the spectral clustering algorithm was used to achieve voicing clustering on the reconstructed probabilistic similar matrix with strong differentiation,forcing the algorithm to converge to the global optimum and effectively reducing the voicing segmentation clustering error rate.
Keywords/Search Tags:Speaker Diarization, Attention mechanism, AAMC-Softmax, Bidirectional cosine similarity, Traction factor
PDF Full Text Request
Related items