Font Size: a A A

Whisper To Speech Conversion And Whisper Recognition Modeling Method

Posted on:2016-12-19Degree:MasterType:Thesis
Country:ChinaCandidate:J J LiFull Text:PDF
GTID:2308330470457760Subject:Communication and Information Engineering
Abstract/Summary:PDF Full Text Request
Whispers are one kind of human-to-human vocal communication mechanism which is widely used in daily life. When we whisper, our vocal cords do not vibrate normally, even when producing vowels and other voiced phonemes. This reduces the energy of whispers and makes them a safe or even intimate method for speech communication in public places without worrying about disturbing others or leaking privacy. However it also makes whispers more difficult to perceive in terms of intelligibility, especially in the presence of background noise.This research mainly focuses on two aspects of whisper interaction:one of which is to convert whisper to natural sounding speech, namely, whisper-to-speech; the other one is recognize whispers automatically, namely, whisper recognition.In this paper, we have proposed three different whisper-to-speech methods. The first one is a parametric conversion framework based on the sinewave speech analysis and re-synthesis model. In this method, we do not need to obtain training data to tune the parameters of the conversion models. By contrast, several formants are extracted from whispers, then both magnitude and centre frequency are obtained through Linear Prediction Coding (LPC) analysis. Sinewave speech is synthesized using this enhanced formant information through sinewave speech synthesis technology, then mixed with the original whisper and synthesized with pitch information which is also estimated from formants, to produce spectral enhanced speech. The method is fast, has low com-putational complexity and achieves better mean opinion score (MOS) than the previous coded excited linear prediction (CELP) based reconstruction models.In addition, we try to model the whisper-to-speech conversion problem using a sta-tistical paradigm. The existing Gaussian mixture model (GMM) can reconstruct normal speech from whispers more naturally than using the non-training method. But GMM also has its own limitations because it can only model compressed Mel-cepstral coef-ficients, and fails to model inter-dimensional correlation due to the diagonal covari-ance setting when the training data is limited. As a results, the speech converted from whispers using a GMM usually sounds slightly’muffled’. To solve these problems, we propose a mixed restricted Boltzmann machine (RBM) based whisper-to-speech model. This can not only model high dimensional spectral envelope information, but also can model inter-dimensional correlation by weight connections between visible layer and hidden layers. The resulting model can strongly outperform the GMM method in terms of subjective evaluation using the wTIMIT English whisper reconstruction task.What’s more, we try to build a deep whisper-to-speech model using deep neural networks (DNN). When a whisper-to-speech spectral envelope DNN regression model is trained using minimum mean squared error objective criteria, following standard RBM-based pre-training and error back-propagation (BP) based fine-tuning method, we found that the DNN is prone to over-fitting when the parallel data for each speaker is limited. To solve this problem, a semi-supervised DNN training flow is proposed in the paper. The input layer and output layer, with their neighbouring hidden layers, can be interpreted as two RBMs operating separately, and these two RBMs are then trained by whisper and normal spectral envelopes individually. The remaining middle layers of the DNN are trained by the binary hidden layer data to build a mapping relation between whisper and parallel normal speech spectral features. The DNN is then used as an inte-gral model to convert the whisper spectral envelope to a normal speech envelope. The speech output by this semi-supervised DNN is slightly better than the speech produced by the RBM based model.Last but not least, we have undertaken research on whisper recognition using the popular DNN-HMM hybrid paradigm. Whisper recognition can extend the application of Large Vocabulary Automatic speech recognition (LVASR) by providing people a way to whisper to their intelligent devices, such as being able to whisper to an Apple watch in a public place in order to send a message without being overheard.However the task of whisper LVCSR is much more challenging than normal LVCSR due to the low signal to noise ratio (SNR) and flatness of whisper spectra. Furthermore, the amount of whisper data available for training is much less than for normal speech. In this paper, a knowledge transfer (KT) DNN acoustic model is deployed to solve these problems. Moreover, model adaptation is performed on the KT-DNN to normal-ize speaker and environmental variability in whispers based on discriminative speaker identity information. On a Mandarin whisper dictation task, with55hours of whisper data, the proposed SI KT-DNN model can achieve56.7%character error rate (CER) im-provement over a baseline Gaussian Mixture Model (GMM), discriminatively trained only using the whisper data. Using this approach, the CER of the proposed model for normal speech can reach15.2%when tested with whispers, which is close to the perfor-mance of a state-of-the-art DNN trained with one thousand hours of speech data. From this baseline, the model-adapted DNN gains a further10.9%CER reduction over the generic model.
Keywords/Search Tags:whisper, whisper-to-speech, whisper recognition, sinewave speech, GMM, RBM, DNN, knowledge transfer, speaker adaptation, speaker code
PDF Full Text Request
Related items