Font Size: a A A

Voice Activity Detection Based On Deep Learning

Posted on:2021-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:T J XuFull Text:PDF
GTID:2428330620476428Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Speech is the most natural medium for interaction of human-to-human and human-to-machine.In real environment,noise always exists and reduces the quality of communication and also the performance of automatic speech recognition(ASR)and speaker recognition systems.Voice activity detection(VAD)is to detect the appearance of the speech in noisy environment.Typically,it is often used as a pre-processing step and plays an important role for ASR system.This dissertation studied on the principles of voice activity detection methods and focused on deep learning based methods.Based on these methods,three supervised learning algorithms for VAD are proposed:1.We proposed a two-stage training method based on convolutional long short-term memory deep neural network(CLDNN).CLDNN is the state-of-the-art deep learning model for VAD.This work analyzed the structure characteristics of it and proposed a two-stage training method with a non-sequential stage and a sequential stage to improve the data utilization.2.We proposed a joint training VAD with related speech enhancement task which utilizes enhanced self-coding as auxiliary features.This work analyzed the three types of joint training algorithms for VAD and speech enhancement,improved the performance of joint training method with self-encoding auxiliary features and extended the joint methods.Then,an automatic hyperparameter weight adjustment algorithm is proposed.3.We improved the VAD algorithm based on likelihood ratio test.There are two defects of likelihood ratio test based VAD,one is that the estimation of parameters is not accurate,and the other is that the decision threshold needs to be set manually.To cope with these two problems,an algorithm which combines statistical signal processing and deep learning is designed,which using time-frequency masking to estimate parameters and calculate the threshold value dynamically by a global average pooling(GAP)layer.Systematic experiments show that the two parts of the proposed method can effectively improve the performance of the traditional signal processing baseline system,respectively.Compared with the end-to-end deep learning method,the proposed has obvious advantages when the model scale is similar.
Keywords/Search Tags:voice activity detection, deep learning, training strategy, joint training, likelihood ratio test
PDF Full Text Request
Related items