Font Size: a A A

Research On Speaker Log System Based On Bayesian Method

Posted on:2021-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:X H ZengFull Text:PDF
GTID:2518306302954129Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
At present,the raw recording data collected from the machine or from the network is just an unstructured binary data stream.The text obtained through speech recognition technology only contains the content information in the speech data,but it lacks the identity information of the speaker corresponding to the content.This restricts downstream tasks such as semantic understanding,role analysis,and voice archiving.Speaker diarization is a technology derived in order to cooperate with other speech technologies.This technology divides and clusters the original recording into several categories,where each category contains all the speech of a single speaker.This article focuses on the two most critical subtasks in speaker diarization: speaker change detection and speaker clustering,as described below:Data preprocessing: This article uses all the voice data of the AMI Meeting Corpus open source audio dataset and its corresponding text record labels.After all voices are cut uniformly according to a length of 380 ms,MFCC features are extracted for each frame of voice and each speaker label is obtained corresponding to the voice.Finally,the generalized end-to-end loss for speaker verification(GE2E)model proposed by Google Research Institute in 2019 is implemented to obtain the vector embedding of each frame of speech.Speaker change detection: This task is a sequence labeling task to determine whether two adjacent frames of speech come from the same speaker.In this paper,the GE2 E model is used to obtain the vector embedding of each frame of speech.The influence of the discrimination threshold on the accuracy of speaker change and the accuracy of speaker unchange is analyzed.In order to prioritize the accuracy of speaker change detection,the threshold can be appropriately reduced.At the same time,this paper explores the frequency distribution of speech length and finds that it has the shape of an inverse proportional function.Then,the sequence of the speaker change detection task is theoretically derived by Bayesian formula derivation: from only using sample speech feature information to use both the sample speech information and the length of the speech;in the experiment,this distribution was added to the speaker change detection task as a priori information,and it is found that the accuracy of the speaker change detection is improved from 80.78% to 84.98%,meanwhile,the accuracy of speaker unchange detection decreases slightly from 83.66% to 81.04%.In general,introducing a prior distribution of speech lengths has a lifting effect on this task.The introduction of speaker length as auxiliary information has the significance of promoting judgment accuracy.Speaker clustering: The task is to cluster some speeches into several clusters.All the speeches contained in each cluster is generated by a single speaker.This paper first introduces the principle of spectral clustering,and based on the spectral clustering method,the similarity matrix is generated using the GE2 E model above to achieve the baseline task of speaker clustering.The results show that the main factors restricting the accuracy of the task is the prediction for number of speakers is too large,resulting in a large speaker diarization error rate of 37.3%.After exploring the distributions of the length of speech,the number of speakers,and the number of speaker change,this article finds that the number of speakers in a speech in a specific context has a certain law,that is,in a conference or conversation The probability of a large number of speakers in a fixed period of time is very small.By adding the three sample distributions as prior information to the speaker clustering task,the speaker diarization error rate is reduced from 37.3% to 7.2%,which shows that the introduction of the distribution of the number of speakers is helpful for improving the accuracy of speaker clustering.In summary,this paper finds that introducing the length of speech information in the speaker change detection task and the number of speakers in the speaker clustering task can improve the accuracy of the two tasks.What's more critical is that these two improvements are independent of specific models,that is,no matter what model is used to implement the speaker diarization task,the prior information about the samples can be introduced through the Bayesian formula.
Keywords/Search Tags:speaker diarization, speaker change detection, speaker clustering, spectral clustering
PDF Full Text Request
Related items