Font Size: a A A

Research On Audio-Video Information Processing Based On Lip-Changing

Posted on:2021-07-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y M WangFull Text:PDF
GTID:2518306461958609Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Speech recognition has always been the key research content of human-computer interaction technology.However,speech recognition in actual scenes is always subject to various kinds of environmental noises,such as ambient unwanted voices and multi-speaker crossover talks,which make speech recognition more challenging,resulting in degraded recognition performance.In view of these shortcomings,lip visual information in a different mode is introduced into the speech recognition system.The visual information is not disturbed by speech,and the lip movement information also contains rich components,which can help determine the speech information.In this context,the author proposes an end-to-end audio-visual speech recognition model.The main research contents are as follows:Firstly,Audio-visual feature extraction and modal processing are carried out.A sparse deep belief network(DBN)with bottleneck structure is proposed to extract audio-visual speech features.In order to avoid dimension disaster and make the traditional DBN more robust to input data,a sparse DBN is constructed by introducing L1/2 norm and L1 norm in the objective function of the network in the way of non-overlapping group Lasso,so as to achieve sparse representation of audio-visual features.Similarly,in order to prepare for the modal level fusion of later audio-visual features,a layer of Bidirectional Long Short-Term Memory(BLSTM)is used to perform modal processing on the features.Next,Modal level fusion of audio-visual information.In order to solve the problem of timing inconsistency when different modal information is fused,the author employs an attention mechanism to"match"the output of BLSTM of the audio stream with the output of BLSTM of the visual stream at each moment.The obtained score is linearly combined with the output of BLSTM of the visual stream,and the result obtained at the corresponding time is the visual stream text vector corresponding to the current output of BLSTM of the audio stream.The two are fused through the connection layer,so that the automatic alignment and fusion can be achieved.The audio-visual information of different modalities can be represented by a more advanced audio-visual fusion sequence,which is convenient for subsequent classification and recognition.Finally,the audio-visual fusion information is classified and identified.The Softmax layer with a layer of BLSTM is used for multi-classification,the input audio-visual sequence is mapped to the probability representation of the output category,and the category with the highest probability is used as the final classification prediction label.Experiments show that the presented algorithm model can effectively recognize audio-visual information,and displays more satisfying recognition rate and robustness compared with some selected algorithms.
Keywords/Search Tags:Audio-visual speech recognition, Sparse deep belief network, Bidirectional Long Short-Term Memory, Modal level fusion, Attention mechanism
PDF Full Text Request
Related items