Bimodal Speech Recognition Technology Research Based On Audio And Video

Posted on:2015-01-15

Degree:Master

Type:Thesis

Country:China

Candidate:N Zhou

Full Text:PDF

GTID:2268330428472596

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

With the development of computer intellectualization, higher demands are made for human-computer interaction techniques, and traditional speech recognition can not meet the requirements of daily applications. To make up the drawback that the poor robustness of traditional speech recognition in strong noise environment, a bimodal speech recognition system based on audio and video is proposed and designed in this paper. Taking advantage of the bimodal characteristics of human language perception, it integrates visual lip motion information into the system of speech recognition, and obtains good results.The research of this paper is mainly divided into audio signal processing, video signal processing and fusion and recognition of audio and video. In the audio signal processing part, the traditional dual-threshold speech endpoint detection method based on short-time energy and short-time zero crossing rate was improved, with better performance in strong noise environment. Then, MFCC is extracted as the audio signal feature parameter. In the video processing part, video frames are extracted dynamically and saved as picture, and the mouth region is roughly found by the face detection technique of OpenCV. Next, the lip region and the skin region are split up from the mouth region picture, according to the difference between lip and skin in the Lab color space. Then the feature of lip is regard as video signal feature parameter, which is extracted by the hybrid feature extraction algorithm of geometric model and pixel. In the fusion and recognition of audio and video part, a audio and video fusion decision algorithm based on weight is used in this paper. It flexibly adjusts the weight coefficient of audio and video according to the SNR and gets a correct result. As Dynamic Time Warping (DTW) algorithm is used in recognition, so the system is only effective for a certain person and isolated words.Finally, the bimodal speech recognition system based on audio and video has been designed and achieved, and a lot of system tests are conducted. Experimental results indicate that the performance of the bimodal speech recognition system is better than traditional single modal speech recognition system in noise environment, so the method proposed has a certain research value.

Keywords/Search Tags:

bimodal speech recognition, lip-reading, feature extraction, fusion of audio andvideo, Dynamic Time Warping (DTW)

PDF Full Text Request

Related items

1	Based On The Design Of Small-vocabulary Speech Recognition System And Speech Recognition
2	Research On Technologies Of Audio-Visual Bimodal Speech Recognition Based On Attention Mechanism
3	A Study Of Speech Features Extraction And Matching Algorithm Under Noisy Conditions
4	A Study On Bimodal Audio Visual Speech Recognition Based On Deep Learning
5	Research Of The Characteristics Parameters Extraction In The Personal Of Speech Recognition
6	Speech Recognition Research Of Non-specific Person’s Isolated Words Based On DTW Model
7	Study On Speaker-Independent Isolated Words Speech Recognition System
8	The Design And Realization Of Speech Recognizing Isolated Word Based On MPU.A
9	Research On The Method Of Gait Feature Extraction And Recognition
10	Research On Expression And Speech Bimodal Emotion Recognition Of Children