Speaker Diarization is a branch of speech signal processing to solve the problem of "who spoke when".As a front-end speech processing technology,speaker diarization is applied to speech recognition,speech separation,and other tasks.It is of great significance for in-depth research on automatic speech recognition and content-based audio classification.At present,there are two kinds of speaker diarization systems: clustering-based speaker diarization and End-to-end neural speaker diarization.The clustering-based speaker diarization system has always been the mainstream framework.It consists of multiple modules,the process is relatively complicated,and the processing effect on the overlapping of speakers is not ideal.It greatly limits its application in the daily environment.The end-to-end neural speaker diarization method turns speaker classification into a multi-label classification problem,where individual sub-modules in the traditional speaker diarization systems can be replaced by one neural network.Determine which one or several people are speaking at the current moment directly based on the input features.This research is based on end-to-end neural method.The main work content is as follows:(1)Implemented a Single-channel speaker diarization system,improved the feature extraction method,and replaced the Mel filter bank with the wav2vec2.0,and then uses a bidirectional long-short-term memory network(BLSTM)for sequence modeling,and uses a 50-hour dataset to train the model.Experimental results show that the Single-channel speaker diarization system can still achieve good results in the presence of noise and reverberation.(2)Traditional spatial feature extraction is generally based on phase information in Fourier transforms,such as generalized cross-correlation phase transform(GCCPHAT)and inter-channel phase difference(IPD),which may not be optimal in the deep learning framework.A multi-channel speaker classification system that combines spatial features and audio features is proposed.The conv2 d method is used to extract spatial information on the sound signal by each pair of microphones,and the wav2vec2.0 model is used to extract acoustic features.The spatial information and acoustic features are spliced together as the input of the speaker diarization model.Experiments show that the addition of spatial information can significantly improve the performance of the diarization system. |