Font Size: a A A

Research On Audio And Video Speech Recognition In Tibetan Lhasa Dialect

Posted on:2022-10-07Degree:MasterType:Thesis
Country:ChinaCandidate:F GaoFull Text:PDF
GTID:2518306332477504Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In tasks of some quiet environment,recent state-of-the-art automatic speech recognition systems can achieve more than 95%accuracy,but there are still many problems in the real-life environment,such as the interference of environmental noise,the low signal-to-noise ratio of audio information collected by far-field microphone and so on,then the audio signal can not meet the requirements of speech recognition.So other modal signals are needed to supplement the audio signal.Compared with audio signal,visual information is not easy to be affected by background noise.In the process of speaking,various organs of human face will also make corresponding changes,which can just make a certain supplement to audio information.According to the existing works,multimodal speech recognition has been carried out in mainstream languages such as English and Chinese,but there is almost no research on Tibetan multimodal speech recognition.Therefore,this thesis mainly studies the multimodal speech recognition and its' application of Tibetan Lhasa dialect.The main work of this paper is as follows.1.Construct a Tibetan audio-video data setIn order to speed up the research of Tibetan Lhasa multimodal speech recognition,this paper constructs and opens a Tibetan Lhasa audio-video data set.Compared with common audio-video data sets,such as TCD-TIMIT,this data set has the characteristics of more complex recording environment and more scenes,which is closer to the real-life application scene.2.End-to-end Tibetan multimodal speech recognition baseline modelThe baseline model in this paper uses the WaveNet-CTC model.Based on the characteristics of end-to-end speech recognition technology and Tibetan language,we select Tibetan single syllable as the recognition unit.In the baseline model,audio information,visual information,the concatenated audio and visual information were feed into the WaveNet-CTC model,respectively.According to the experimental results,the concatenation of audio information and visual information will not improve the recognition accuracy in our self-built Tibetan Lhasa dialect data set.We analyze the possible reason is that the speaker's head posture and facial expression are quite different in the Tibetan videos,which affects the extraction of lip motion features,indicating that the concatenation of audio feature and video feature have certain limitations.3.cross-attention based end-to-end Tibetan multimodal speech recognitionIn order to solve the limitations of concatenated audio feature and video featureand make better use of video modal information for speech recognition,this paper proposes a cross-attention mechanism,which is used in the baseline model of Tibetan multimodal speech recognition based on WaveNet-CTC.The experimental results show that,compared with the baseline model,the introduction of the attention mechanism in the primary fusion stage of audio features and visual features can improve the performance of speech recognition.4.Latent regression Bayesian network based end-to-end Tibetan multimodal speech recognitionIn addressing the limitations of concatenated audio and video features,this paper not only explores the feature fusion method of multimodal features,but also explores work about representing the existing feature data.This paper attempts to introduce the latent regression Bayesian network into the input layer of the end-to-end model for learning the data representation,which tries to extract hidden features from the spectrogram in audio stream and the original image of the video stream as the replacement of the MFCCs and lip motion feature.However,according to the experimental results,the hidden features extracted by the latent regression Bayesian network do not have better speech recognition performance than the manually selected features.5.The implementation of WeChat applet for Tibetan audio-video speech recognition systemWe develop the Tibetan multimodal speech recognition system based on WeChat applet using tensorflow framework of deep learning and tomcat tool.The system can obtain video data through the WeChat applet,and then sends the recognition results to WeChat applet.Through the research of Tibetan audio-video speech recognition,this paper not only complements the work of Tibetan multimodal speech recognition,but also proposes cross-attention mechanism in the construction of multimodal recognition model,which effectively fuses multimodal features and solves the limitation of concatenated feature to a certain extent.In addition,this paper also explores the use of latent regression Bayesian network at the input layer of the end-to-end model to learn the hidden features of audio and video raw data to replace the manually extracted MFCCs features and lip motion features,and attempt to use newly obtained hidden feature to avoid the limitations of concatenated features.
Keywords/Search Tags:Tibetan multimodal speech recognition, end-to-end learning, cross modal attention mechanism, latent regression Bayesian network
PDF Full Text Request
Related items