Font Size: a A A

Lipreading Based On Deep Learning

Posted on:2021-11-04Degree:MasterType:Thesis
Country:ChinaCandidate:K C WuFull Text:PDF
GTID:2518306476950859Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
The lipreading task is to identify the content of the speaker's speech through the speaker's lip movements.The key to lipreading is how to effectively extract feature vectors that represent lip movement information.Deep neural networks can update the weights of a large number of parameters through objective functions and backpropagation mechanisms to automatically learn the features associated with the target task,which has achieved great results on the lipreading task,but due to the complexity of the lipreading task itself and the diversity of lip movements,there are still many difficulties and challenges with the lipreading task.In response to these problems,a word-level lipreading model based on a deep learning attentional mechanism is proposed and achieves good results on a lipreading task for target words with contextual information,with the model having a Top1 accuracy of 86% on the LRW dataset and38.58% on the LRW-1000 dataset.And the lipreading task requires voice endpoint detection,but in the noise conditions,endpoint detection using voice data is less effective.Therefore,a visual model based on deep learning is proposed for speech endpoint detection using lip motion information.The main work and innovations of this paper are as follows:(1)a word-level lipreading model based on a deep learning attention mechanism is proposed.The model uses spatio-temporal 3-dimensional convolution to extract the spatiotemporal features of the lip image sequence,uses the channel attention mechanism to weight the image features to enhance the effective features and to suppress the invalid features,uses the Long Short Term Memory network to model the temporal relationships of the features,uses the temporal attention mechanism to weight the features at different points in time,and learns the correlation between the features at different moments and the final identification results.Finally,the superiority of the model is verified by comparing it with the best lipreading model available,and the effectiveness of the channel attention mechanism and the temporal attention mechanism is demonstrated by comparison experiments.(2)a visual model based on deep learning is proposed for speech endpoint detection task.The model uses the lip motion information of the visual model as input,and previous consecutive frames is considered for endpoint detection.First,a spatial-temporal 3-dimentional convolutional network is proposed to extract the spatial features of the lip region in the image sequence and the features in the short time dimension corresponding to the lip motion information.Then the temporal features is further extracted through the Long Short Term Memory network.Finally the frame data is classified as speech segment or non-speech segment through the full connection layer.It was experimentally verified that under low signal-to-noise acoustic conditions,the visual model endpoint detection is better than the deep learning speech model and speech algorithm based on speech features and mechine learning classifier.
Keywords/Search Tags:Lipreading, Attentional Mechanism, Long Short Term Memory Network, Speech Endpoint Detection
PDF Full Text Request
Related items