Font Size: a A A

Research On Dynamic Bayesian Network Models For Audio-Visual Specch Recognition

Posted on:2008-09-16Degree:MasterType:Thesis
Country:ChinaCandidate:A L SunFull Text:PDF
GTID:2178360212978890Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Dynamic Bayesian Network (DBN), because of extensibility, powerful description, inference and learning abilities for the time series, being used in the speech recognition. In this paper, the author designs a single stream DBN model for audio or video speech recognition and phoneme (or veseme) segmentation. The works of this paper is outlined as the following:First, the author investigates the Continuous Speech Recognition System based on Hidden Markov Model (HMM), including the processing of embedded training and recognition. The connecting digital audio and video database has been recoded. For audio stream, Mel Filterbank Cepstrum Coefficients (MFCC) features be extracted, for video stream, three kinds of lip features be extracted: 1) static geometric features, 2) static and delta dynamic geometric features, 3) linear interpolation geometric features based on static and dynamic features. Audio experiment results show that tri-phone HMM has higher word recognition rates than monophone HMM. Video experiment results show that the third lip features has higher word recognition rates than the others.Second, studying the basic principle of DBN, topology, probabilistic inference formula, Tree Inference, Frontier Inference and Junction Tree Algorithm. Results show that DBN is more universal, explicit and extensible than HMM.Third, studying and improving the Word-State DBN (WS-DBN) model, design the acoustic speech model based on Word-Phone DBN (WP-DBN) model, the visual speech model based on Word-Viseme DBN (WV-DBN) model, implement the system of WS-DBN and WV-DBN with Graphical Model Toolkit (GMTK). The WP-DBN and WV-DBN models emulate the structure of word-phone (or word-viseme), show the transition probabilities between phones (or visemes) and the character of the output the phone (or viseme) segmentation with timing boundaries.Finally, the author defines evaluation criteria of word recognition rates, word recognition accuracies and phone (or viseme) segementation score. Compare the recognition and segmentation performances of the WS-DBN model, WP-DBN model, WV-DBN model, monophone HMM, tri-phone HMM and monoviseme HMM in different noisy environments. Audio experimental results show that WP-DBN model: 1) almost has the same recognition rates compare to the tri-phone HMM for clean speech; 2) are more robust to noisy environments compare to the HMM. Video...
Keywords/Search Tags:Dynamic Bayesian Network (DBN), Graphical Model Toolkit (GMTK), Word-Phone DBN (WP-DBN), Word-Viseme DBN (WV-DBN)
PDF Full Text Request
Related items