| Sign language is a vision-based language that provides a medium for deaf people to communicate.Research on sign language translation and generation technology is closely related to many cutting-edge research fields including computer vision,natural language processing,cross-media computing and human-computer interaction.Early researchers devote themselves to isolated sign language recognition,whose researches date back more than 30 years.Recently,due to the development of deep learning techniques,an increasing number of works have focused on continuous sign language translation and sign language video generation.Although some of the latest techniques have achieved widespread success in sign language research,the research on sign language translation and generation still faces many difficulties and challenges.(1)Sign language translation is a typical weakly supervised sequential learning task.Since sign language datasets usually only contain sentence-level annotations and lack the precise temporal location of each sign language action,the process of sign language translation and generation inevitably requires consideration of insufficient supervision.(2)The temporal semantic representation of sign language videos is complex.Sign language is a multi-type visual representation that includes facial micro-motions,local finger movements,and body movements,which leads to the need to understand sign language videos at different temporal scales and spatial granularities.(3)Sign language translation involves some challenges in blended semantic learning.How to jointly represent multimodal cues in a unified framework and achieve semantic alignment of cross-modal data remains intractable.(4)Sign language generation faces the weak semantics problem of input text.Compared with the complex video output,the semantics of the input sign sentence is weak.How to strengthen the guidance from textual semantics during sign language video generation is also an urgent problem to be solved.In order to solve the above problems and challenges,this dissertation proposes a series of methods for sign language translation and generation tasks.The main researches are summarized as follows:Pseudo-supervised Sign Language Translation Based on Online Joint Optimization.A connectionist temporal modeling method with joint loss optimization is proposed for weak supervision and temporal semantic representation in sign language translation.First,the method designs a dual-stream short-term temporal learning stage,including a temporal convolution pyramid module to achieve(2D+1D)=pseudo 3D convolution feature learning,and a 3D convolution module for short-range spatiotemporal joint modeling.Then,a dynamic decoding scheme for long-term temporal learning is implemented using a bidirectional recurrent neural network and a connectionist temporal classification network.This scheme directly learns the temporal mapping between visual features,sign labelings and generated sentences,and the resulting label alignment scheme is regarded as pseudo-labels.Finally,using the pseudo-supervised cues obtained from the above steps,a joint loss optimization is used to simultaneously measure feature correlation,entropy regularization of sign labeling and probability maximization of sentence decoding in an end-to-end framework.The method achieves better performance than other online translation models without introducing additional supervision.Sign Language Translation Based on Multimodal Sequential Graph Embedding.Sign language translation research often involves sign language input signals from multiple sources,so it is necessary to consider the cross-modal correlation learning between multimodal features,and requires exploration of the temporal correlation within the modality.Aiming at the above problems,a multimodal sequential feature embedding method based on graph neural networks is proposed in this dissertation.Specifically,a graph structure is constructed to enable inter-modal correlation learning and intra-modal temporal exploration.First,the method designs a graph embedding unit that embeds parallel convolutions with channel and temporal learning into a graph convolutional network to learn temporal cues and cross-modal complementarity in each modality sequence.Then,a hierarchical graph embedding unit stacker with pooling skip connections is proposed.To obtain a compact and informative representation of multimodal sequences,the hierarchical graph embedding unit stacker gradually compresses the channel dimension instead of the temporal dimension,thus preserving more temporal cues.Finally,a connectionist temporal decoding strategy is employed to explore temporal correlations across the entire video stream and translate feature sequences into complete sign sentences.This method simultaneously addresses the problems of short-term temporal cue mining and multimodal complementarity learning in a unified model.Sign Language Pose Generation Based on Textual Semantic Enhancement.Aiming at the issue that textual semantics are much weaker than the visual context in sign language generation tasks,a gloss semantic-enhanced network with online backtranslation is proposed.Different from the existing methods that only focus on the regression prediction of pose coordinates(i.e.,fitting the true label values of pose coordinates as much as possible),the proposed method emphasizes the enhancement of textual semantic guidance and the constraint of cross-modal semantic consistency.Specifically,the network consists of a sign gloss encoder,a pose decoder,and an online reverse text decoder.First,in the transformer-based sign gloss encoder,a learnable gloss token is introduced that explores the global context dependencies of the entire sign gloss sequence without any sign priors.Then,a progressive sign poses recurrent decoding model is designed.During pose decoding,sign gloss tokens are aggregated onto the generated pose sequences as semantic guidance.The aggregated features then interact with the entire sign language word embedding vector and generate the pose for the next moment.Finally,a reverse sign gloss decoder is designed,which back-translates the generated poses into the gloss sequence and aligns them with the original sentences.The model ensures semantic consistency in the bidirectional conversion of gloss-to-pose and pose-to-gloss during sign language generation. |