Font Size: a A A

Knowledge Distillation For Speech-assisted Lip Reading

Posted on:2021-01-13Degree:MasterType:Thesis
Country:ChinaCandidate:R XuFull Text:PDF
GTID:2428330623969205Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Lip reading has a wide range of applications in daily life,such as assisting speech recognition in noisy acoustic environments,facilitating communication for people with disabilities,and generating subtitles for black and white silent films.In recent years,with the vigorous development of deep learning,many essential breakthroughs have been made in the field of lip reading.However,at the same time,there are some difficulties and challenges.For example,compared to the tasks such as image classification and neural machine translation,the number of training samples for lip reading is small,and the lip movement has inherent visual ambiguity,which increases the difficulty of extracting distinguishing features of the model.Regarding the issue above,this thesis proposes a knowledge distillation method for speech-assisted lip reading to improve the performance of the model better.This thesis analyzes from three aspects: Firstly,video signals and audio signals are related.For the same sequence of text,the signals of two different modalities contain the same information.Secondly,the existing automatic speech recognition dataset is large,and the model performance is well.Thirdly,the knowledge distillation can transfer the knowledge learned by the teacher model to the student model.Therefore,this thesis proposes three knowledge distillation methods with different granularity levels in the output space.Specifically,at the character level,the problem of inconsistent length between the speech recognition model decoded output sequence,and the real target output sequence is effectively alleviated by introducing the method of solving the longest common subsequence.At the sequence level,using the result of the beam search decoding,the contextual knowledge learned by the speech recognition model is transferred to the lip-reading model.At the character-sequence hybrid level,the results of the beam search decoding of the speech recognition model are combined with the real target output sequence to assist the lip reading model for training.Besides,this thesis also considers that for the same text sequence,the video signal and audio signal can provide complementary information.Therefore,this thesis proposes a knowledge distillation method that uses the trained automatic speech recognition model to strengthen the lip reading model in the feature space.In particular,this thesis analyzes the limitations of the existing LIBS method and proposes the corresponding optimization scheme by adding a video feature level knowledge distillation loss function,thereby enhancing the constraints on the video feature extraction module.By using an alignment method similar to the attention mechanism,the problem of the unequal length of video signals and audio signals is solved,and the relationship between the two signals is established.Through a series of experimental results analysis on the English lip-reading LRS2-BBC dataset,the knowledge distillation method for speech-assisted lip-reading proposed in this thesis is effectively verified.Compared to the benchmark method(WAS),the performance of the models has been improved.
Keywords/Search Tags:Lip Reading, Knowledge Distillation, Automatic Speech Recognition, Deep Learning, Cross-modal
PDF Full Text Request
Related items