Font Size: a A A

Study On Voice Activity Detection Algorithm Based On General Gaussian Model

Posted on:2006-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:N ZhangFull Text:PDF
GTID:2168360155953142Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
The system of separating conversational speech and silence is called Voice Activity Detector (VAD). Such a system is required in many speech processing applications such as speech coding, speech enhancement, telephony, wireless communications and speech recognition. Various type of VAD algorithms have been proposed recently, based on energy, zero crossing rates, periodicity, cepstral features, and decision-making based on a combination of different parameters. The VAD decision is done according to the decision rule, which can either be based on a decision threshold or statistical models. 1. Decorrelation of Speech Signal Since the correlation between the successive samples of a speech signal is commonly rather high, a speech data vector can be represented to a small degree of error by a small number of components in the KLT domain. The KLT is an orthonormal transform. One drawback of the KLT is that it is data dependent and varies slowly over time with speech signals. Moreover, the KLT is computationally expensive and must also be estimated adaptively; therefore, simple orthogonal transformations such as DCT, Discrete Fourier Transform (DFT), and Short-time Fast Fourier Transform (SFFT) are often used instead in many applications as suboptimal alternatives. Of all discrete orthogonal transforms, the DCT is the most popular, not only for its nearly optimal performance in whitening lowpass signals, but also for its computational efficiency. We considered DCT) as a computationally inexpensive speech signals into very reasonably uncorrelated components. Several experiments show that the performance of the DCT is comparable with that of the KLT. One main reason for this is the estimation errors of the KLT. 2. Statistical Distribution Model of Speech and Noise Signals In discussing the waveform representations of speech signals, it is often assume that the speech waveforms can be represented by an ergodic random process. Based on this simplification, the statistical model of speech has been studied. In early 50's, the probability density is investigated by Davenport, using the histogram of speech samples over long time periods around 3 minutes. The results have shown that a good approximation to measure speech amplitude densities is a gamma distribution, and a poorer and simpler approximation is Gaussian or Laplacian distribution. A Gaussian statistical model of speech and noise signal in DFT domain, which is based on the Central Limit Theorem, has been used in VAD algorithms. The Gaussian model is based on the fact that each Fourier expansion coefficient is a weighted sum of random variables resulting from the transform process. As the frame length increases, the coefficient will tend to a Gaussian distribution. The Gaussian model has been also used in the DCT domain. Several tests has been done and proved that the Gaussian model is a good and simple approximation to measure noise amplitude densities. Several works have been done in the area of Image processing. The estimation of the distribution of the DCT coefficients of image is addressed here. Reininger and Gibson used Kolmogorov-Smirnov (KS) test to verify that the DCT coefficients of images have a Laplacian distribution. Laplacian model has been used in VAD algorithm because it is a simpler approximation of speech signal amplitude densities. And several VAD algorithm based on Laplacian model has been proposed and proved to be effective. In this paper, we proposed a VAD algorithm based on Generalized Gaussian Model in DCT domain. Some test has shown that the Generalized Gaussian Model is a better approximation of speech signal amplitude densities. Muller showed that the Generalized Gaussian Distributions (GGD) would give a significantly lower value of the test statistic. Some tests have been done and proved that full frame DCT coefficients can be modeled by Generalized Gaussian distributions, which have better performance compared with the Laplacian distributions. χ23. Parameters Estimation of Speech and Noise Signal The speech signal is nonstationary, in the sense that the statistical parameters characterizing the speech distributions vary slowly with time. This assumption allows us to update the distributions recursively. The estimation of speech and background noise statistics is possible provided that the speech statistics are stationary over 20 ms time frames and the noise statistics are stationary over a longer time period. The noise statistics might be updated with the samples of previous detected within the active frames. We use the maximum likelihood (ML)...
Keywords/Search Tags:Voice Activity Detection, General Gaussian Model, Soft Decision, Hidden Markov Model
PDF Full Text Request
Related items