Research On Speaker Diarization Based On Microphone Array

Posted on:2023-08-25

Degree:Master

Type:Thesis

Country:China

Candidate:W L Tang

Full Text:PDF

GTID:2568307097997399

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Speaker Diarization is a branch of speech signal processing to solve the problem of "who spoke when".As a front-end speech processing technology,speaker diarization is applied to speech recognition,speech separation,and other tasks.It is of great significance for in-depth research on automatic speech recognition and content-based audio classification.At present,there are two kinds of speaker diarization systems: clustering-based speaker diarization and End-to-end neural speaker diarization.The clustering-based speaker diarization system has always been the mainstream framework.It consists of multiple modules,the process is relatively complicated,and the processing effect on the overlapping of speakers is not ideal.It greatly limits its application in the daily environment.The end-to-end neural speaker diarization method turns speaker classification into a multi-label classification problem,where individual sub-modules in the traditional speaker diarization systems can be replaced by one neural network.Determine which one or several people are speaking at the current moment directly based on the input features.This research is based on end-to-end neural method.The main work content is as follows:(1)Implemented a Single-channel speaker diarization system,improved the feature extraction method,and replaced the Mel filter bank with the wav2vec2.0,and then uses a bidirectional long-short-term memory network(BLSTM)for sequence modeling,and uses a 50-hour dataset to train the model.Experimental results show that the Single-channel speaker diarization system can still achieve good results in the presence of noise and reverberation.(2)Traditional spatial feature extraction is generally based on phase information in Fourier transforms,such as generalized cross-correlation phase transform(GCCPHAT)and inter-channel phase difference(IPD),which may not be optimal in the deep learning framework.A multi-channel speaker classification system that combines spatial features and audio features is proposed.The conv2 d method is used to extract spatial information on the sound signal by each pair of microphones,and the wav2vec2.0 model is used to extract acoustic features.The spatial information and acoustic features are spliced together as the input of the speaker diarization model.Experiments show that the addition of spatial information can significantly improve the performance of the diarization system.

Keywords/Search Tags:

speaker diarization, end-to-end, spatial features, neural network

PDF Full Text Request

Related items

1	Research On Speaker Diarization In Multi-person Conversation Scenarios
2	Speaker Diarization Based On Deep Neural Network With Hybrid Structural
3	Design And Implementation Of Speaker Diarization System
4	The Modeling Research In Speaker Diarization
5	Research On Speaker Diarization In Multi-person Scenarios
6	A Study On Speaker Diarization Based On Multiple Features
7	Research And Implementation Of Key Technology In Speaker Diarization System
8	Research On Speaker Diarization Based On Deep Learning
9	Speaker Diarization: Current Limitations and New Directions
10	Research On Non-concurrent Speaker Separation Technology For Corpus Acquisition System