Font Size: a A A

Research On Sign Language Recognition Method Based On Modal Fusion

Posted on:2022-08-24Degree:MasterType:Thesis
Country:ChinaCandidate:S LuFull Text:PDF
GTID:2505306533472654Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the widespread application of artificial intelligence in the field of human-computer interaction,the use of intelligent means to help deaf-mute groups communicate without barriers has become one of the hot research issues in the field of artificial intelligence.Diversified deep learning-based sign language recognition methods,which have developed rapidly in recent years,have effectively and intelligently helped the deaf-mute group to communicate with the outside world without barriers.However,high redundancy of multi-source data and insufficient ability of representation learning exist in this technology.The use of multi-modal data fusion technology is the key to effectively solving the problem of multi-source data representation learning.With the development of multi-source imaging equipment,sign language data takes on various forms.The research of sign language recognition based on multi-modal fusion is also emerging.However,there are many difficulties in the current research of multimodal sign language recognition methods,such as the complex and high redundancy of sign language data,the insufficient representation learning ability of specific modal data,and the fusion of multimodal data.Especially for the recognition tasks of continuous sign language sentences,the problems of video timing segmentation,semantic alignment and data labeling have brought huge challenges to it.In view of the above problems,this paper conducts research on sign language recognition method based on the modal fusion.The main research contents and contributions are as follows:(1)This paper constructs a sign language data acquisition system,and constructs a Chinese Daily Sign Language dataset(CDSL),with a total of 7000 samples.Each sample contains three modalities: color video,depth video and skeleton point video.This paper also expands the modalities of sign language data through adding optical flow image data.In addition,this paper uses a key frame detection model to remove silent and redundant frames in sign language videos to obtain key frame sequences.(2)Aiming at the high redundancy and insufficient representation ability of multi-modal sign language data,a sign language recognition model based on modal fusion is proposed.The model takes the key frame sequences of color video and depth video of sign language as the input,uses two dilated three-dimensional convolutional neural networks to extract the spatiotemporal features of two sequences respectively.Then the model fuses the spatiotemporal features of two sequences,and inputs the fusion feature into the long short-term memory network for timing modeling and classification,thereby constructs a two-stream sign language recognition network.At the same time,the sign language recognition model based on multi-modal uses the spatiotemporal graph convolutional network to extract and classify the key frame features of the skeleton point video,then fuses the classification scores of the two-stream sign language recognition network and the spatiotemporal graph convolution network by way of decision fusion.(3)Aiming at the difficulty of semantic alignment in continuous sign language sentence recognition,a continuous sign language sentence recognition model based on attention is proposed.The model uses the dilated three-dimensional convolutional neural networks to extract the shallow spatiotemporal features of color video and optical flow image sequence respectively,and inputs the features to a dual-modal encoding network composed of a bidirectional long short-term memory network and an attention network;finally,a decoding network based on connectionist temporal classification is used for processing decode,to get the target semantic sequence.This model utilizes the complementary representation ability of bimodal data effectively,and uses the attention mechanism to capture the key information of high-level features,improves the independence assumption problem of connectionist temporal classification,and realizes an end-to-end continuous sign language recognition.(4)Aiming at the problem of less data annotations in continuous sign language sentence recognition and the difficulty of output sentences to meet the grammatical relationship,a continuous sign language sentence recognition model based on modal matching is proposed.This model uses a time adaptive convolutional neural network with a small amount of parameters to extract the spatiotemporal features of the key frame clips of color video and optical flow image sequences,and fuse the features;then maps the word sequence and the fusion spatiotemporal features to the same latent semantic space;finally uses the encoder to learn the long term spatiotemporal features of key frame clips,and inputs the long term spatiotemporal features to the decoder together with the word feature vector,to realize the matching and alignment between sign language video and the word sequence.In the experimental stage,the comparison under different evaluation criteria was added to verify that not only the accuracy of the model has been improved,but the output sentences are also more in line with daily communication habits.The models proposed in this paper are all tested on the Chinese daily sign language data set collected by ourselves,and compared with a variety of mainstream algorithms to verify the accuracy and feasibility of the models.
Keywords/Search Tags:sign language recognition, modal fusion, convolutional neural network, reccurent neural network, attention mechanism
PDF Full Text Request
Related items