Font Size: a A A

Whispered Activity Detection Based On Modified Group Delay Function Numerator

Posted on:2017-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q ShangFull Text:PDF
GTID:2308330482989754Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Whispered speech as a supplement to normal speech under certain circumstances can convey some sensitive information in public places. In view of the unique advantages of whispered speech, whisper analysis draws more and more attention from domestic and foreign scholars. Whispered activity detection(WAD) is the first key step to mine whisper advantage further.Whispered speech in the application context usually emerges with low signal to noise ratio(SNR). Compared with normal speech, the distinct way of whisper pronunciation results in a lack of fundamental frequency. The vocal tract structure of whispered speech will change with formants shifting to the high frequency area. This is especially noticeable with low frequency formants. So the conventional voice activity detection methods aren’t quite qualified for the whispered activity detection.Concerning this issue, the thesis proposes to use a fusion of two complementary features, subband modulation spectrum(SMS) feature and subband correntropy(SCE)feature, both extracted from the Hilbert envelope of the numerator group delay(HNGD) spectrum. Compute group delay-based instantaneous spectrum by zero-time windowing and group delay function so as to improve time resolution and frequency resolution better than the traditional discrete Fourier transform(DFT) spectrum.The SMS features capture the spectral representation of the subband energy time trajectories which acquires both the short-term as well as long-term spectral characteristics of whispered speech in order to provide good separation between speech and noise components. To decrease redundant information of a supervector,perform a principal component analysis(PCA) to reduce dimension over the dataset.The SCE features model the fluctuations in the subband energy time trajectories which help capture the dynamics of the vocal tract system to discriminate noisy whisper from noise. This is easy to understand: WAD based on single feature in high SNR condition obtains good results; however, noise interference leads to sharp decline in the detection performance. The deficiency of the one-sidedness of single feature is made up by considering the complementary nature of the two features to form the fusion features for the performance improvement.Based on whispered speech material from the CHAINS(CHAracterizing INdividual Speakers) Speech corpus and five typical noise sources from the NOISEX-92 database, the proposed features’ performance is evaluated using support vector machine(SVM). To avoid the appearance of over-fitting and under-fitting fitting effectively, the classifier parameters are optimized on the training data usingcross validation strategy. WAD performance is investigated by changing window size.A comparison of the proposed fusion features with some standard features such as AMS-ST(Amplitude Modulation Spectrogram-Modulation Spectral Tilt Values)features, MFCCs(Mel-frequency Cepstral Coefficients with Delta and Acceleration Coefficients), LTLEV(Long-term Logarithmic Energy Variation) and RASTA-PLP(Relative Spectral Perceptual Linear Prediction Coefficients) is made. The experimental results show that the performance of the modulation spectrum are better than the MFCC, RASTA-PLP and LTLEV features across all noise types at all SNRs.The proposed HSMS-CE features perform significantly better than the AMS-ST features for babble and factory noises, especially at 0 d B SNR. But for white and pink noises at 0 d B, AMS-ST features have a better performance. In the remaining cases,both features reveal the similar performance.
Keywords/Search Tags:Whispered Activity Detection, Instantaneous Spectrum Analysis, Zero-time Windowing, Group Delay, Modulation Spectrum, Correntropy
PDF Full Text Request
Related items