Font Size: a A A

Study On Soft Voice Activity Detection Based On Generalized Gamma Distribution In Transformed Domain

Posted on:2008-10-13Degree:MasterType:Thesis
Country:ChinaCandidate:S Q WangFull Text:PDF
GTID:2178360212497449Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
VAD (Voice activity detection), also called endpoint detection, is a kind of technologies which adopts certain digital signal processing technology to detect whether voice signal exists. In many speech signal processing domains, such as speech coding, speech recognition, speech synthesis as well as speech communication systems, all of them relate to VAD technology. Moreover, the accuracy of test results always has a great influence on various speech signal processing. For example, in speech recognition system, VAD could enhance system robustness. VAD could also improve the quality of synthesized voice. In speech communication systems, using VAD could realize variable rate speech coding and increase capacity of communication systems. Although the voice activity detection can get very high-level accuracy in a quiet environment, the changes of external environment and noise in real world will significantly degrade system performance.With the arrival of digital informational era and the rapid development of mobile communication technology, high quality speech communication is desired. Speech signal processing has become a very important component of voice communication systems. Moreover, VAD still plays an important role in all kinds of speech processing system. And people have taken more focus on voice communication at low SNR condition recently. The technology of VAD can not be taken into application unless the robust problem is solved. Hence, the research on VAD algorithms in noisy background is very significant.Recently, various VAD algorithms have been proposed under different environments, we can approximately divide these algorithms into two general categories according to the decision rule. One is based on threshold, the other based on statistical models. The former mainly includes the following kinds: the VAD based on short time energy, short-time zero crossing rate, short time autocorrelation, periodicity, spectrum similarity, cepstral coefficients, short-time frequency band variance, entropy and so on. The later often takes the HMM pattern identification mode as representative. As every method has its own advantages and disadvantages, we list all the major ones below:Short time energy-based and zero crossing rate-based VAD algorithms are simple in principle and have good real-time performance. Besides, they have been proved to provide good performance in quiet environments by massive experiments. But under low SNR condition, these two algorithms will result in a poor performance, even unworkable. Because the autocorrelation-based VAD algorithm is quite sensitive to the category of noise, a new VAD method is improved based on autocorrelation similar distance. It was found that the improved VAD algorithm can provide better performance than the autocorrelation-based method. Furthermore, autocorrelation similar distance based VAD algorithm was shown to be robust to non-periodic interferences and provided reliable detection accuracy in low SNR levels. The algorithm based on periodicity works reliably in white and impulse noise environments, but the major drawback is the algorithm is sensitivity to any periodic noise signals. The VAD based on spectrum similarity is a simply method and can provide a good result both at the beginning and the end of speech. Unfortunately, it failed in the environment having low SNR especially when the background noise was non-stationary or a mechanical sound. A cepstrum-based VAD algorithm has been shown to effectively provide high accuracy results particularly in low SNR condition. But its main disadvantage is the high computational complexity. Entropy could enlarge the characteristic vector. Moreover, experiments have shown that the VAD algorithms employing the entropy worked more reliably than the solely energy-based methods. But they failed in babble and musical noises, as well as periodic background noise.According to the above discussion, we can come to the conclusion that the performance of VAD algorithm based on sole feature could not satisfy the demands of speech communication system and speech recognition system in practice. Therefore, various features were combined together, and composed a feature vector to enhance the performance of VAD algorithms. Currently, people have proposed some other VAD algorithm based on wavelet transform, neural network and higher-order statistics. The decision rules have been developed from sole-threshold, double-threshold and multi-threshold to the fussy theory-based.In this thesis, we summarized some typical methods of VAD algorithm, and then proposed a statistical model-based soft voice activity detection algorithm in transformed domain. In recent years, speech processing has been found to be very attractive in some transformed domains that make signals uncorrelated. Among these transformations, the most efficient ones are the Discrete Cosine Transform (DCT) and adaptive KLT. DCT and KLT can concentrate the speech energy predominantly into a few coefficients, and consequently enable to enhance the performance of the VAD. In this paper, the speech and noise signal are firstly decomposed into uncorrelated components by DCT. The probability distributions of decorrelated noisy speech and noise signal are both assumed to be generalized gamma distribution (GΓD ),as the recent investigations proves that it can provide a better model of speech and noise signal than Gaussian, Laplacian or Gamma pdf. Generalized gamma distribution was proposed as an alternative parametric pdf, which is a more efficient parametric modeling of speech distribution. A computationally inexpensive online algorithm is also proposed to estimate the principal parameters associated with GΓD according to the maximum likelihood (ML) principle. In order to enhance the statistical reliability in estimating noise parameters, we introduce the concept of a global speech absence probability (GSAP) as a measure of speech inactivity. In likelihood ratio test, we employ a smoothed likelihood ratio (SLR) to substitute the conventional likelihood ratio (LR) in order to overcome the disadvantage due to the time-delay item in statistic computing. Through the experiments, it is observed that SLR-based scheme can efficiently alleviate the detection errors in the offset regions of speech frames. Finally, as the correlation of signal samples in consecutive frames is strong, the sequence of frame hypothesis states can be modeled as a first-order Markov process. In addition, a Hidden Markov Model (HMM) is employed with two states representing silence and speech to estimates the probability of Voice Being Active (VBA), recursively. In order to achieving a considerable performance improvement of the VAD, this paper adopts a HMM-based soft decision rule to prevent clipping of weak speech. Experiments results and objective test results show that the proposed soft VAD algorithm based on GΓD outperforms the algorithms based on the other statistical models using DFT. Additionally, the proposed VAD gives the detection performances superior to G.729 Annex B VAD and traditional Sohn's VAD in many kinds of noise environments.
Keywords/Search Tags:voice activity detection, generalized gamma distribution, Hidden Markov Model, smoothed likelihood ratio, discrete cosine transform
PDF Full Text Request
Related items