Font Size: a A A

Research On Speaker Diarization In Multi-person Scenarios

Posted on:2024-09-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y X DuFull Text:PDF
GTID:2568307076996939Subject:Control Science and Engineering (Pattern Recognition and Intelligent Systems)
Abstract/Summary:PDF Full Text Request
Speaker diarization focuses on dividing an audio stream into segments containing only the speaker’s voice and labeling them with different speaker identities,which solves the problem of "who spoke when." To extract advanced abstract information from audio,manage speech data efficiently,and perform effective indexing and querying,speaker diarization research has gained increasing attention.In actual life,small-scale meetings occupy a large proportion.Most researchers using a microphone array as a recording device usually ignore the spatial information of the microphone array.However,using the spatial information of the microphone array can effectively improve system performance.For the Augmented Multiparty Interaction(AMI)meeting dataset,there are problems such as noise and a large number of short speech segments,etc.,which lead to high error rates and low robustness in speaker diarization,further cannot meet the needs of industrial applications.In this thesis,speaker diarization tasks combined with microphone array spatial information are researched in depth.An internal dataset is constructed to simulate scenarios with many short speech segments in multi-person scenarios.Some new thoughts and methods are provided for the speaker diarization task.The specific research results are as follows:(1)For improving the performance of the speaker diarization system by utilizing speaker number information,in this thesis a method of using the spatial information of the microphone array to form a speaker location chart is proposed.Firstly,the spatial information of speech segments is obtained through the steered response power phase transform algorithm.Then,to deal with the influence of noise on sound source localization in the dataset,a speech enhancement algorithm is used to process the microphone array’s audio.Finally,the speaker location chart is constructed using spatial information,and the number of speakers in the meeting is obtained from the location chart to limit the number of clusters.The experimental results show that the results of proposed method are always better than thoese of other papers on this dataset,and compared with the current best system,the speaker diarization error rate is reduced by more than 26%.(2)In this thesis,a speaker diarization method is proposed that does not depend on a neural network for speaker modeling based on the spatial information of the microphone array and the speaker location chart to implement the speaker diarization task without relying on traditional modular methods.Firstly,the speaker location chart is obtained,and the number of speakers is estimated by the method proposed in(1).Then,the average angle between each pair of peaks in the location chart is calculated after determining the angle corresponding to each peak in the location chart.Finally,the corresponding time points of each speaker region are extracted and marked as the same speaker,and all the speaker labels of each time segment are integrated into the output file of the speaker diarization.The experimental results show that the proposed method is more straightforward than traditional modular-based methods and performs better than the neural network-based d-vector method,with a relative reduction in speaker diarization error rate of over 34%.Furthermore,the system’s performance is further improved after introducing speech enhancement algorithms.(3)In order to address the issue of low recognition rates for short speech segments in the task of speaker diarization in multi-speaker scenarios,an internal dataset in this thesis is constructed to simulate the situation of multiple short speech segments in such scenarios.A method is proposed that utilizes the spatial information of a microphone array in conjunction with normalized maximum eigengap spectral clustering.Speech segments longer than 0.6 seconds are clustered using a neural network-based speaker model,which extracts corresponding speaker embeddings for clustering.Speech segments shorter than0.6 seconds are clustered using a speaker diarization method based on microphone array spatial information.The speaker labels obtained from these two methods are aligned using the bipartite graph maximum matching algorithm to address speaker ambiguity issues.By utilizing the spatial information from the microphone array and the results of the normalized maximum eigengap spectral clustering,the proposed method re-recognizes speakers of short speech segments in the initial clustering results using microphone array sound source localization,which effectively addressing the issue of low speech recognition rates in meetings.Experimental results show that the proposed method reduces speaker diarization error rates by more than 47% compared to the baseline system.
Keywords/Search Tags:speaker diarization, beamforming, speech enhancement, sound source localization, speaker cluster
PDF Full Text Request
Related items