Font Size: a A A

Bimodal Speech Recognition Technology Research Based On Audio And Video

Posted on:2015-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:N ZhouFull Text:PDF
GTID:2268330428472596Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the development of computer intellectualization, higher demands are made for human-computer interaction techniques, and traditional speech recognition can not meet the requirements of daily applications. To make up the drawback that the poor robustness of traditional speech recognition in strong noise environment, a bimodal speech recognition system based on audio and video is proposed and designed in this paper. Taking advantage of the bimodal characteristics of human language perception, it integrates visual lip motion information into the system of speech recognition, and obtains good results.The research of this paper is mainly divided into audio signal processing, video signal processing and fusion and recognition of audio and video. In the audio signal processing part, the traditional dual-threshold speech endpoint detection method based on short-time energy and short-time zero crossing rate was improved, with better performance in strong noise environment. Then, MFCC is extracted as the audio signal feature parameter. In the video processing part, video frames are extracted dynamically and saved as picture, and the mouth region is roughly found by the face detection technique of OpenCV. Next, the lip region and the skin region are split up from the mouth region picture, according to the difference between lip and skin in the Lab color space. Then the feature of lip is regard as video signal feature parameter, which is extracted by the hybrid feature extraction algorithm of geometric model and pixel. In the fusion and recognition of audio and video part, a audio and video fusion decision algorithm based on weight is used in this paper. It flexibly adjusts the weight coefficient of audio and video according to the SNR and gets a correct result. As Dynamic Time Warping (DTW) algorithm is used in recognition, so the system is only effective for a certain person and isolated words.Finally, the bimodal speech recognition system based on audio and video has been designed and achieved, and a lot of system tests are conducted. Experimental results indicate that the performance of the bimodal speech recognition system is better than traditional single modal speech recognition system in noise environment, so the method proposed has a certain research value.
Keywords/Search Tags:bimodal speech recognition, lip-reading, feature extraction, fusion of audio andvideo, Dynamic Time Warping (DTW)
PDF Full Text Request
Related items