Font Size: a A A

Deep Learning Based Sign Language Recognition

Posted on:2019-03-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:J HuangFull Text:PDF
GTID:1318330542497990Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The study of sign language recognition has two main aspects:isolated words and continuous sentence recognition.Isolated word recognition focuses on a gesture per-formed by the user and attempts to identify it.In continuous recognition,the user needs to perform gestures one by one and the goal is to recognize each gesture performed by the user.This paper studies the above two tasks.With the help of deep learning that has attracted attention in recent years,great progress has been made on several large-scale public data sets.Sign language recognition has the following problems:1)The sign language mo-tions are mainly characterized by the variance of hand-shape.But due to its rapid changes,large deformations,and occlusion,it makes the design of discriminative sign motion representation difficult.2)Sign language video sequences have strong redun-dancy,such as spatial background,transitional frames,static frames,etc.The redundant information has interference and makes the recognition problem complicated;3)The ultimate purpose of sign language recognition is to realize the continuous sentences recognition,but it depends on the segmentation of sequences and the recognition of iso-lated words.There is no obvious characteristic in the transition of sign language,and it is difficult to achieve accurate segmentation.In order to solve the difficulty of designing discriminative sign language represen-tations,described in problem 1),we propose an isolated word recognition method based on a three-dimensional convolutional neural network.With the powerful feature learn-ing ability of deep convolutional neural network,we skip the steps of detection,track-ing and segmentation of hand region.Besides,the design of hand-crafted features is avoided.The three-dimensional convolutional neural network takes the original video segment as input,and learns the temporal and spatial characteristics of motions.Be-cause the neural network requires a fixed input size,the video stream is segmented into clips using a sliding window and sent to the network to extract features.The vector ob-tained after aggregation of the feature sequences is used as a representation of the video.Based on this representation,the classification is implemented using SVM.In order to increase the recognition accuracy,we used RGB-D data to improve the performance via the complementarity between the two modal data.To remove redundant information,which is described in problem 2),we propose an isolated word recognition method based on the attention mechanism.Spatially,since the sign language action is mainly performed in the arm and palm areas,other regions are irrelevant backgrounds,and this redundant information will cause interference,so we hope to preserve only the pixel information of the target area.Besides,the impor-tance of information at different time steps is different.Therefore,we use the recur-rent neural network to aggregation feature sequence via attention pooling.Specifically,sign language actions mainly focus on pixels in the palm and arm regions.According to this a priori,we do point-of-view filtering for each frame of video.Simulating the human visual system mechanism,it highlight the pixels of the target area and darken the background and irrelevant areas.After this,the spatio-temporal features are learned and extracted through a convolutional neural network.Each video is represented by a feature sequence and then encoded using a recurrent neural network to obtain a video representation.The recurrent neural network integrates the attention mechanism and assigns different weights to the feature vectors at different time steps.The redundant information will get very low scores.In addition,in order to further increase the recog-nition accuracy,trajectory features based on the shape context are extracted from the joint coordinate position information in addition to the RGB-D data.Fuse video features and trajectory features and then connect with a softmax layer to achieve classification.To remove redundant information,we further propose an isolated word recogni-tion method based on key segment selection and hierarchical attention network.Unlike previous integration of all frame information into the video representation,this method first performs key segment selection,removing redundant information,and then builds a two-level attention network to model the filtered sequence.It takes into account the structure and ambiguity of the sign sequence.The purpose of key segment selection is to remove redundancy,mainly static frames and transition frames.The goal of a layered attention network is to learn the characterization of the video for classification from the key sequence.The network uses two layers of structure to learn the weights of the sequence.The first layer is a short-term attention module based on the convolu-tional neural network.It independently learns the weights of the video frames in each clip and generates the clip representation.The second layer takes the clip representation sequence as input,measures the importance of each clip,and generates video represen-tation for classification.The entire model has two optimization goals:key clip selection and recognition.The two goals depend on each other,so they are alternately optimized by the Expectation Maximization-style algorithm and promote each other.In order to circumvent the difficulty of temporal segmentation described in prob-lem 3),we propose a continuous sentence recognition method based on latent space and recurrent neural networks.In order to improve the performance,we first redesign the sign language video representation and use a two-stream three-dimensional convolu-tional neural network to learn local hand-shape features and global trajectory features at the same time and aggregate them as video clip features.To circumvent the segmen-tation step,we use a recurrent neural network to implement the sequence-to-sequence mapping,encode the input video sequence as a hidden state vector,and then decode it into the target text word sequence.However,this process only learns the mapping rela-tionship between video and text,ignoring the correlation between the two modal data.Therefore,we simultaneously learn a latent space in the recognition process to bridge the semantic gap between the two data.
Keywords/Search Tags:Sign language recognition, 3D convolutional neural network, Recurrent neural network, Attention mechanism, Latent space
PDF Full Text Request
Related items