Font Size: a A A

Research On Robust Speech Recognition

Posted on:2008-11-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:J DongFull Text:PDF
GTID:1118360212997985Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Speech signal is the most convenient and effective intercommunication mode. With the rapid development and wide application of computer technology, people hope more and more to realize the natural man-machine communication by speech. Automatic speech recognition (ASR) emerges as the times require and has achieved quite remarkable progress in recent years. Now it is being applied to the real-world applications from the laboratory research theory and may be the leading user interface for the followon operating system and application program.Most speech recognition systems are designed for clean speech and relatively easy to accomplish fairly complex recognition tasks with high accuracy in controlled quiet laboratory environments. However, when a ASR system is used in a real-life situation, there is bound to be a mismatch between training and testing caused by background noise. The performance of systems deteriorates severely, which is the most major obstacle to the commercial use of speech recognition technology. So, how to increase the robustness of ASR is significant and necessary. The aim of robust speech recognition is to alleviate the effect of mismatch and to achieve good recognition performance in noisy conditions. Various methods have been studied in this area, which can be broadly classified into 3 categories– speech enhancement in signal space, robust feature extraction in feature space and speech model compensation in model space. In this paper, we focus on the first two problems i.e. improving the speech recognition accuracy in signal space and feature space using some new approaches under additive background noise. The main attributes are listed as follows.1,Speech enhancement aims at extracting clean speech from noisy signal while suppressing noise, minimizing distortion of speech and enhancing speech intelligibility. For robust speech recognition, speech enhancement often exists as a preprocessor and produces an almost clean speech signal to a ASR system. Consequently, it is not necessary to make any changes in the recognition system to make it robust. Currently, most enhancement algorithms present important limitations, as they only focus on one given noise. With noise diversification, the techniques are becoming more and more complex. Moreover, many algorithms aim at improving intelligibility in mind, the enhanced speech signal may lose some useful information, which can degrade the performance of ASR system. To cope with these problems, in this paper, a Kalman-filter speech enhancement algorithm based on higher-order cumulants is proposed.The performance of Kalman-filter algorithm is mostly up to the precision of clean speech LPC parameters and the impulse gain. Considering the good robustness of higer-order cumulants to Gaussian noise, the LPC parameters of clean signal can be estimated by solving the modified Yule-Walker (MYW) equation of third-order cumulant of noisy signal. At the same time, the impulse gain needed is proposed to be approximately obtained by the estimated model parameters and the noise variance.Based on three objective measures-the power spectrogram, time domain waveform and SNR, the enhancement performance is evaluated respectively under nine types of noise with different SNR conditions. Simulation results show that this algorithm is simple, effective and robust in the presence of very complicated noise. There are significant improvements both in SNR and in apperception quality, besides the distortion of enhanced speech is very small. Therefore, this algorithm is especially adapted to robust speech recognition preprocessing as well. In isolated word speech recognition system, experiments show that this cascading can improve recognition accuracy at low SNR levels.2,We propose an adaptive recursive estimation algorithm of AR model parameters based on conjugate gradient when solving the third-order cumulant MYW equation. By contrast with the estimation errors of noisy AR sequence using RIV, direct inversion and LMS separately, this algorithm has the most rapid convergence and the highest accuracy without a mass of matrix inversion operation. At the same time, reconstructing the power spectrum of noisy sine sequence and speech signal by use of parameters spectral estimation algorithm, the model parameters estimated by conjugate gradient have good performance in envelope fitting, formant acutance and resolution even if the SNR is very low. 3,Pitch detection is one of the most difficult technologies in speech signal processing under noisy conditions. According to the transmissibility of signal discontinuity under different resolution of wavelet transform, a new method for pitch detection on the basis of wavelet transform and circular AMDF (WCAMDF) is presented in this thesis. The method overcomes the disadvantages of low accuracy, high complexity and lack of robustness in many actual pitch detection algorithms. Simulation results indicate that the proposed algorithm possesses better pitch detection precision for speech signals under strong background noise, low calculation complexity, high resolution, and capability for real time implementation.4,The wavelet transform is adaptive to signal. This paper researches the multi threshold estimation of regular signal based on wavelet transform and its application in speech enhancement area. The noisy speech signal can be denoised by using of wavelet. We point out that the SURE translate soft threshold is the most adaptive to speech signal from theory analysis and experiments, and the enhancement performance is perfect. The evaluations are performed on the power spectrogram, time domain waveform and SNR, it is shown that this method is effective in noisy conditions.5,The VAD technology plays a very important role in ASR systems. The correct endpoint detection can reduce the computational cost and shorten the run time. A major cause for errors in speech recognition is the incorrect detection of the beginning and the ending boundaries of the test. So, the reliable, accurate, real-time, adaptive and robust VAD technology is needed in every recognition system. Based on wavelet transform, two novel strategies are proposed for accurate and robust endpoint detection under noisy environments in this paper.1) Endpoint detection algorithm based on WCAMDF pitch extraction. WCAMDF can extract exact pitch information against variations of noisy environments. Therefore, by use of the magnitude envelope of CAMDF during the process of pitch extraction, the proposed algorithm is verified that improved robustness is achieved in both detection accuracy and recognition performance at low SNR levels, with an average recognition error rate reduction of more than 21%. 2) Endpoint detection algorithm based on energy-entropy of wavelet. It is found that the detection using basic energy and spectral entropy becomes difficult and inaccurate when speech signals are contaminated by colored noise, and the main specificity of wavelet transform is that the residual noises in enhanced speech signals are almost white. As a consequence, we try to couple them together closely, instead of using the energy-entropy feature of initial noisy signals, the feature are computed after wavelet transform. This modification outperforms basic energy-entropy, improves the discriminability between speech and noise so that it becomes easier to set threshold.The two endpoint detection approaches can go along with pitch extraction or speech enhancement simultaneously. They are realtime, simple, easy to realize, and have small model complexity, which is very important especially in large vocabulary ASR systems where processing power and memory available are limited. 6,In real world, robust features extraction is one of the most crucial issues in the field of ASR applications. It aims at finding succinct, salient, and representative relevant characteristics from noisy speech utterance to discriminate. The selection of robust features is highly desired in order to offer acceptable recognition performance under various noisy environments. Mel-frequency cepstral coefficients (MFCC) have been well accepted as a good choice for speech features with reasonable robustness, and many advanced techniques have been developed based on them. Three new improved methods are proposed based on MFCC in this thesis.Teager energy-Entropy MFCC (TEMFCC). Teager energy-entropy features are commonly used for locating the endpoints of an utterance. When integrated with MFCC, it is shown to offer an average accuracy increase of 10% as compared to MFCC in baseline system. The selection of Teager energy-entropy increases the dimension of feature vectors. In order to overcome this shortcoming, we can perform the classification and dimensionality reduction of the feature vectors by use of Linear Discriminant Analysis (LDA) technology. LDA-TEMFCC robust features, 20 dimensions, yields 6% increase of recognition performance by contrast with 24 dimensions MFCC in baseline. The MFCC, directly derived from power spectrum of noisy speech signals, show excessive sensitivity to external additive colored noise and generally result in degradation of recognition performance in noisy conditions. By virtue of powerful Gaussian noise restraint property of HOC, HOC-LPC-MFCC feature vectors are developed. The speech power spectrum is reconstructed by the model parameters estimated from third-order cumulant of noisy signal, and MFCC is derived from the reconstruction. The experimental results show that significant noise robustness can be achieved by the use of the proposed features in all conditions as compared to the pure MFCC.
Keywords/Search Tags:robust speech recognition, speech enhancement, conjugate gradient, wavelet transform, pitch extraction, endpoint detection, feature extraction
PDF Full Text Request
Related items