Font Size: a A A

Audio-Visual Speech Recognition And Its Applications

Posted on:2021-05-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:R F SuFull Text:PDF
GTID:1368330623465075Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Speech signals are easily corrupted in an adverse acoustic environment by various undesired components,such as ambient noise,far-field noise and narrowband channel,performance of traditional automatic speech recognition(ASR)systems based only on acoustic inputs tend to be greatly degraded and thus practical applications of which still not possible.One solution to this is to augment visual information into ASR systems to construct audio-visual speech recognition(AVSR)systems.The addition of visual information in AVSR systems is motivated by the concept of multimodal speech perception in human beings.In human speech communication,the brain perceives speech by integrating acoustic and visual information such as lip motion.Previous studies reported that visual features can provide complementary information to improve speech perception ability especially under an adverse acoustic environment when acoustic signals alone may fail to faithfully convey all necessary information for understanding.It follows that AVSR systems have demonstrated superior performance over traditional ASR systems across a wide range of applications.Therefore,AVSR systems have attracted more and more attentions.Current AVSR systems mainly focus on two key issues.The first is the requirement of large amount of audio-visual parallel data for constructing robust AVSR systems.Compared with audio-only data,large amount of audio-visual parallel data are very expensive to acquire.In addition,traditional AVSR systems require both audio and visual data during the testing time.However,only audio data are available and visual information is missing in many practical applications.These pose a major challenge of a wider use of AVSR systems.To address these issues,the present dissertation proposed a novel AVSR system framework,in which additional visual data are automatically and reliably generated from the audio data obtained from the real word,and these visual features can further be used for constructing AVSR systems to improve their robustness under an adverse acoustic environment.Major contributions of the present project include:(1)A convolutional neural network(CNN)based audio-visual integration approach was proposed,in which separate convolutional sub-networks were first used for acoustic and visual modelling respectively,and thus audio and visual information were allowed to transmit at different times in these convolutional sub-networks.High-level audio-visual representations were obtained from such convolutional sub-networks.This allowed the following shared fully connected network to better exploit the mutual long-term dependence between audio and visual data.Experimental results showed that,compared with the traditional audio-visual integration approaches,the CNN based audio-visual integration approach yielded a significant reduction of about 15% relative in the error rate.This approach can be used for modeling the independence,asynchrony and long-term dependence between audio and visual data.It is very important for further research of deep learning based audio-visual integration approaches.(2)A novel visual feature generation based bimodal modelling approach was proposed.Based on bi-directional long short-term memory recurrent neural networks,acoustic-to-visual inversion models are initially trained using limited amount of audio-visual training data.Visual features can be then automatically generated from the acoustic inputs.Finally,robust AVSR systems were obtained by using these generated visual features.Experimental results showed that,when training and evaluation data were acquired from the same acoustic environment,only a small amount of audio-visual parallel data was needed to construct a competitive AVSR system by using the proposed visual feature generation approach.It outperformed the comparable ASR baseline by an error rate reduction of about 11% relative.The visual feature generation based bimodal modelling approach can effectively solve the missing visual information issue of AVSR systems in practical uses.(3)A multi-level adaptive deep neural network based cross-domain adaptation was proposed.The acoustic domain mismatch between real-world audio-only data and audio-visual parallel data might have led to unreliably generated visual features,which would degrade the AVSR system performance.To address this issue,a novel cross-domain adaptation method was proposed in the current dissertation.In the method,a multi-level adaptive deep neural network was proposed to extract the feature representations containing the inherent characteristics in the real-world acoustic environment.Such features were used as additional inputs of inversion models to reduce the acoustic domain mismatch,and then reliable visual features could be generated from the widely available audio-only data.Experimental results showed that,due to the acoustic domain mismatch,directly applying the visual feature generation approach without cross-domain adaptation in AVSR systems resulted in no system performance improvements.In contrast,the AVSR system trained by using the combination of the cross-domain adaptation and visual feature generation approach,significantly outperforms the baseline system by an error rate reduction of over 10% relative.The proposed framework was the first attempt in the field of AVSR technology development.It can reduce the dependence on large amounts of audio-visual training data for constructing AVSR systems.Besides,it allowed AVSR systems to be used when only audio data is available in the testing time and thus allows a wider application of AVSR systems in the real world.
Keywords/Search Tags:Audio-Visual Speech Recognition, Visual Feature Generation, Cross-Domain Adaptation
PDF Full Text Request
Related items