Font Size: a A A

Speech Endpoint Detection Based On Statistical Models

Posted on:2018-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:H R WeiFull Text:PDF
GTID:2358330515481565Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
The aim of Voice activity detection(VAD)is to detect speech and non-speech segments or background noises in audio signals.It often be taken as an important front-end step in most state-of-the-art speech processing applications,such as speech recognition,speaker recognition and speech transmission.Voice activity detection has been studied for many decades and the energy VAD is most commonly used.Energy VAD performs well under noise-free environments but deteriorates under noisy environment.Self-adaptive VAD performs much better than the traditional energy VAD in many aspects.However,one issue is that,the single one minimum energy threshold of the self-adaptive AVD could not perform well under the conditions with different channel varieties or background noises.In this paper,we make several improvements on the self-adaptive VAD to deal with that issue and enhance the detection performances.A k-means based average energy clustering approach is proposed to find better minimum energy thresholds for each speech recording.In the VAD decision phase,the new threshold is used for the likelihood ratio test.Furthermore,better results have been achieved by applying the median filtering as a post-processing step of self-adaptive VAD to smooth the short-time noise VAD errors.Experimental results on a subset of the NIST 2006 speaker recognition evaluation(SRE)dataset show that our proposed method outperforms both the traditional energy-based and self-adaptive VAD approaches.Recently,those VAD algorithms based on deep neural networks(DNNs)have attracted more and more researchers' attention due to its outstanding performances.In this paper,several improvements are proposed to improve the classical DNN based VAD.First,spectral subtraction is used to improve the performance under low signal-to-noise ratio environments.Then,a self-adaptive median filtering method is proposed to smooth the short-time noise.Furthermore,a supervised learning rule which is similar to human's "easy things first" learning rule is proposed.Using this rule,the neural network training could be accelerated.Experimental results on a subset of the AURORA2 dataset showed that the proposed VAD using spectral subtraction and self-adaptive median filtering has achieved a 31.12% relative performance improvement,and the supervised learning rule indeed speed up the training processing.
Keywords/Search Tags:Voice Activity Detection, Energy clustering, Spectral Subtraction, Self-Adaptive Median Filtering, Deep Neural Network, Supervised Learning Rule
PDF Full Text Request
Related items