Font Size: a A A

Construction Of Visual Speech Representation For Bi-modal Based Speech Recognition

Posted on:2014-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y F HanFull Text:PDF
GTID:2268330392473727Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the field of human-computer interaction, the traditional audio-only basedspeech recognition technology can reach a comparatively high recognition rate forconsecutive words and phrases in the relatively quiet environment. However, when itis applied in the real environment, the recognition ability will be greatly hamperedbecause of the influence of factors such as background noise. The cognitive process ofhuman language is the perception of a multi-channel process. In the humancommunication, people understand the contents of the speech not only through sounds,but also observe mouth movements and expressions. Visual information as animportant source can significantly help to improve the speech recognition rate.Therefore, the paper mainly explores the effective feature of visual speeches and thecoupled hidden Markov based dual-channel speech recognition algorithm for theaudio-visual speech recognition system.Firstly, the paper introduces the research background and the significance of theresearch about visual speech, elaborates on the present research of the visual speechand analyses of the mainstream methods. On this basis, the paper introduces theframework of our audio-video based Chinese character recognition system where thesystem has four parts including data acquisition and pre-processing module, featureextraction module, analysis and integration module and identification module. Amongall the jobs, the detection of lip area and the extraction of the visual speech featuresand an effective fusion method for audio and video information are the key points toimprove the recognition capability of the whole system. It is also the focus of thisstudy. The main work of this paper is as follows:1) Lip shape class based AAM templates. Accurate feature point location directlyaffects the accuracy of the latter computing of geometric features and location of innerlip region., In view of various texture in the state of opening/close mouth and overlapof inner and outer lip corner points which causes the extra large labor job on manualadjusting after automatic point location with AAM, this paper proposes the approachto classify the samples bases on the mouth shapes and to train AAM templatesseparately. Here, we define three typical AAM templates:â‘ Mouth closed AAMtemplate;â‘¡The "O" type AAM template;â‘¢Ordinary AAM template. It improvespoint location results and reduces the manual work. 2) Inner lip texture feature. Feature extraction plays an important role in speechrecognition. Whether the feature has the ability to well reflect the intrinsiccharacteristics of the objects, and make the objects to be distinguishable will directlyaffect recognition accuracy rate. According to the Chinese pronunciation rulesespecially reveal the state and shape of the teeth and tongue during pronunciation, thepaper proposes several inner lip texture models based on statistical characteristics,including the histogram of the inner lip region, the sub-block histogram of upper andbottom half inner lip region, proportion of teeth visibility, the average pixel value ofthe inner lip block subspace, color moment and discrete cosine transform. All thesemodels will be analyzed and be selected to use as the basis of our final feature.3) Visual feature analysis. In order to verify the representation capability of eachinner lip texture model, and discrimination performance with texture model indifferent components of RGB, HSV, Lab and YCrCb color spaces, this paper madesome feature screening and analyzing experiment utilizing several supervisingclassification algorithm. The static visual speech images representing in differentfeature and feature combination with various color components were classified. Basedon the classification accuracy rate, the color space and components are selected andinner lip feature is determined. Finally the visual speech feature combining withselected lip texture model and geometric feature of outer lip are generated afterprocessing of dimensionality reduction and normalization.4) Based on self-captured bimodal speech dataset, the coupled hidden Markovmodel based bimodal Chinese character pronouncing recognition system has beenrealized. Main job including: First, by limiting the state number of information flowand the degree asynchronous between the information flow to simplify the modelstructure, then through equivalent transformation and traditional HMM complete theCHMM algorithm, the CHMM mid-term integration based audio-visual speechrecognition system retained the independence of the audio and video and alsoachieved the audio and video state asynchronous process modeling on time. Second,though the single and dual-channel comparative experiment based on HMM andCHMM, on the one hand, further validate the representation skills of eachsingle-channel feature model, on the other hand verify the bimodal audio-visualrecognition has better recognition results.
Keywords/Search Tags:Inner lip texture, Color space, HMM, Bimodal, CHMM
PDF Full Text Request
Related items