Sign Language(SL),as a special visual natural language,relies on multi-channel information such as manual features and non-manual features to convey language information.In recent years,Sign Language Translation(SLT),as an important application for bridging communication gap between the deaf and the hearing,has attracted widespread academic attention.And SLT based on neural machine translation framework is a newly emerging research field with the development of artificial intelligence.We found that based on the existing research framework,it is difficult to deeply dig out the implicit language features of sign language as a special natural language in a weakly supervised manner.To this end,we put forward improved ideas from visual heuristics and semantic heuristics:From the perspective of semantic heuristics,we believe that it is beneficial to introduce additional word-level semantic knowledge of sign language linguistics to assist in improving sign language translation.However,this idea requires modeling to solve the problems such as sign language segmentation,multi-modal fusion and sequence alignment.Hence,we proposed a Knowledge-Based Multi-modal Features Fusion Encoder for Dynamic Graph Sign Language Translation Model.This is the first time that we have introduced the concept of graph neural network to neural sign language translation tasks.In the graph neural sign language translation model,we designed an embedding module of multi-modal graph for the first time,which is used to quantify sign language visual features and sign gloss features,so that the multi-modal encoder can fuse graph network and multi-modal features.From the perspective of visual heuristics,we found that there is a lot of redundant information in the input video sequence.This kind of redundant information generally exists in the temporal neighborhood in the form of similar frames,especially in longer sentences.The existence of redundant frames not only takes up space,consumes memory,and introduces a large amount of noise,but also increases the complexity of graph neural network.To this end,we introduced a Frame Stream Density Compression algorithm.The algorithm effectively reduces the redundant information of input frames and the number of invalid nodes of graph network in an unsupervised manner,and enhances the density of effective information flow.This method is also instructive for other low-resource data processing.We conducted experiments on a publicly available popular sign language translation dataset RWTH-PHOENIX-Weather 2014T to verify our proposed methods.Experiments show that our optimized models outperform the state-of-the-art baseline model. |