Font Size: a A A

Research On Noise Treatment Of Speech Recognition With Lip-movement Information

Posted on:2011-09-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:X H FengFull Text:PDF
GTID:1118330332472016Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Audio-visual speech recognition (AVSR), also known as bimodal speech recognition, has becoming a promising way to significantly improve the robustness of ASR. Motivated by the bimodal nature of human speech perception, work in this field aims at improving ASR by exploring the visual modality of the speaker's mouth region in addition to the traditional audio modality. This thesis addresses some key issues of AVSR, namely lip contour extraction, visual feature extraction, audio visual fusion and so on. Some main contributions are proposed in the thesis:1) An audio-visual bimodal continuous Speech Database for vehicular voice control is collected. This database includes 26 persons (14 male, 12 female) speaking every continuous sentence 4 times. There are 68 sentences in this database,which are got from the conclusion of survey.2) An adaptive mouth region detection algorithm was presented based on multi-color spaces. The algorithm combined color edge detection in RGB color space and threshold segmentation in HSV color space. Furthermore, according to the mouth position of the face, an adaptive lip localization method was introduced to detect the position of mouth baseline automatically. Then the rectangular area of mouth region was detected by projection method. Experiment results show that the proposed algorithm can locate mouth region fast, accurately and robustly. The correct rate is to 98.25%. And compared to Principal Component Analysis (PCA), the accuracy improvement is 3.37%.3)To increase the accuracy and speed of lip contour extraction, we propose an improved Geometric Active Contours (GAC) model based on Prior Shape (PS) and mutil-directions of gradient information for lip contour detection. Here, the mutil-directions of gradient information and lip prior shape are introduced into the energy function of level set. The improved GAC model avoids the outline of lip contour extraction using tradition GAC model. Experiments results show that the accuracy improvement of the lip contour detection using PS-level set model is 8.38% over GAC model.4) A dynamic visual feature extracting method based on frame distance and LDA is also proposed. The proposed feature has not only captured important lip motion information, but also embodied a priori speech classification information. Evaluation experiments demonstrate that static feature with frame distance can significantly improve 3.25% for DTCWT. Static feature with LDA also can significantly improve 6.50% for DTCWT. Then With further delta and delta-delta augments, the recognition rate can improve 9.44% and 14.43% respectively. The final dynamic feature can make recognition accurate improvement is 20.12% compared to the static feature.5) A bimodal training model is proposed to improve the recognition rate of audio-visual feature fusion. Consider that the infect of noisy because of training data and testing data not match with each other, and the recognition speed, we use noisy training model and basic training model to finish audio-visual feature fusion speech recognition. Here, we use two audio-visual speech databases which are English AMP-AVSp and Mandarin BiMoSp to do the experiments. Experiments results show that using the bimodal training model can improve the recognition accuracy for both databases. Such as when SNR=-5dB in the testing data, the recognition improvements for both database are 45.27% and 37.24% respectively.6) A new weighting estimation method based on Integer Linear Programming (ILP) is developed to estimate the optimal exponent weighting for combining audio (speech) and visual (mouth) information in Audio-visual decision fusion. According to log-likelihood linear combination of the two streams and the rule of Maximum Log-Likelihood Distance (MLLD), the ILP model is built. In the experiments, we use exhaustive search (ES) and frames dispersion (FD) of hypothesis as compared methods. The results in ILP model are similar with ES model and are superior to FD model. As we know, ES can get the optimal result, that means ILP also can get the optimal stream weighting for Audio-visual decision fusion speech recognition.
Keywords/Search Tags:Audio-visual speech recognition, lip movement, contour extraction, dynamic feature, Audio-visual fusion
PDF Full Text Request
Related items