Font Size: a A A

Research On Tibetan Speaker Diarization Of End-to-end Based On Deep Learning

Posted on:2024-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:H W GuoFull Text:PDF
GTID:2555307124964159Subject:Engineering
Abstract/Summary:PDF Full Text Request
Speaker diarization is a task to solve "who is speaking at what time",and it is an important front-end processing technology for speech recognition.In recent years,speaker diarization techniques have made rapid progress during this period as the emergence of deep learning has driven revolutionary changes in research and practice across speech applications.It has achieved good research results in mainstream languages such as English and Chinese.But there are few studies on speaker diarization of minority language Tibetan.This paper proposes an end-to-end speaker diarization of Tibetan based on deep learning.Two methods based on BLSTM and self attention mechanism are used.Experiments show that the method based on the self-attention mechanism works better than the BLSTM-based method in general.The model based on the self-attention mechanism has a DER of 16.29% on the simulated mixed Tibetan dataset,and a DER of 38.68% on the real recorded Tibetan dialogue dataset.The main research work and content of this paper are as follows:1.Design of Tibetan speaker diarization corpus.This paper first determines the direction of recording Tibetan corpus by analyzing the historical development process of Tibetan dialects.Secondly,by analyzing the acoustic characteristics of Tibetan,a Tibetan speaker label is constructed.On this basis,the Tibetan corpus was recorded,and the Tibetan conversation recordings of two people were recorded.Then,the format of the recorded Tibetan corpus was changed according to the experimental requirements.Two Tibetan language datasets were used in this experiment,the first being a simulated Tibetan mixed dataset,with a total duration of 11 hours.The second is a real recorded dataset of Tibetan conversations,with a total duration of 6 hours.The simulated mixed Tibetan dataset is represented by Simulated-T in this article,and the real recorded Tibetan dialogue dataset is represented by Real Tibetan.2.An end-to-end Tibetan speaker diarization architecture is implemented.Compared with the modular speaker diarization architectures,the advantage of the end-to-end architectures is that it can directly optimize and reduce the diarization error through a single neural network without using a clustering algorithm.It describes the task of speaker diarization as a multi-label classification task.And the end-to-end Tibetan speaker diarization method can solve the problem of Tibetan voice overlap.3.A BLSTM-based end-to-end Tibetan speaker diarization method is proposed for research.Firstly,speech is preprocessed to extract features and obtain speech sequences.Input speech sequences into a single neural network,and output directly to obtain speaker tags.In the process of training the model,To address the ambiguity of speaker labels,a Permutation Invariant Training(PIT)loss function and a Deep Clustering(DPCL)loss function are introduced.Different voice overlap rates are set,and the experiment shows that the higher the Tibetan voice overlap rate,the lower the DER value,and the better the experimental effect.The BLSTM model has achieved good results in simulating the mixed Tibetan dataset.When β=2 and the training set is 10,000 sentences,its experimental effect is relatively good,with a DER of 15.33%.4.As the BLSTM neural network only focuses on its previous and next hidden states,this paper proposes an end-to-end Tibetan speaker diarization method based on the self-attention mechanism for research.The self-attention mechanism works by computing the pairwise similarity between all frames,conditioned on all other input frames.This paper argues that this mechanism is key to segmental clustering of Tibetan speakers,as it can capture global characteristics of speakers in addition to local voice activity dynamics.This model can also solve the problem of Tibetan phonetic overlap.The model of the self-attention mechanism is better than the BLSTM model when the number of training sets is large.The experimental effect in the real recorded dialogue Tibetan data set is far better than the BLSTM model.The DER of recorded Tibetan dataset is 38.68%,which has a higher generalization quality.
Keywords/Search Tags:Speaker diarization, Tibetan, BLSTM, Self-attention, Deep learning
PDF Full Text Request
Related items