Font Size: a A A

Research On Video Caption Generation Depth Model Based On Video Temporal Attention Level Fusion Mechanism

Posted on:2022-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:L Y ZhouFull Text:PDF
GTID:2518306743973999Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
Video caption generation is a video content understanding task,i.e.,when given a video clip,a computer can automatically generate a natural language caption containing dynamic information about the video scene or a caption summarizing the video information.With the development of modern information technology and the advent of the5 G era,video data is growing like an explosion,and the Internet world may now be filled with a large amount of illegal and non-compliant video content.At the same time,there are more than ten million hearing-impaired people in China who use sign language to communicate,and it is extremely difficult to make normal people understand the language of sign language movements of deaf people.Therefore,it is especially important to let the computer combined with artificial intelligence technology to automatically understand the content of the video and the expression of the signer.This technology can not only save the labor and time cost of video review and filtering in the background of major video sites,but also facilitate the communication between deaf and hearing people.In this paper,we combine the techniques of both computer vision and natural language processing to develop a deep model for video subtitle generation,and the specific research work and innovation points are as follows.1)A novel depth model for video caption generation is proposed.The spatial information of the video is extracted using a spatial embedding module,and a bidirectional gating loop module and a depth residual stacking gating loop layer are implemented to further encode the spatio-temporal features of the video,and a decoder is able to effectively identify and generate subtitle text from the spatio-temporal features of the video.For the sign language video content understanding task then combined with Transformer language model,the conversion from continuous sign language recognition Gloss sequence to natural language description of continuous sign language translation is performed.2)Further,in order to better measure the relationship between video visual features and subtitle text semantics and to close the modal distance between them,a video temporal attentional hierarchical fusion mechanism is proposed,based on which a video subtitle generation depth model combining spatially embedded features of low-level video content association semantics and high-level abstract features can generate the final subtitle text more accurately.3)The above deep model for video subtitle generation is performed in a series of comparative experiments on the first large continuous Chinese sign language translation and recognition video dataset with complex life context,Tslrt,and a self-built video audit dataset,Audit V.To further validate the effectiveness of the proposed model,experiments are conducted on another large continuous Chinese sign language recognition dataset Chinese-CSL and compared with other published methods,and the results show that the method in this paper can achieve the best results.
Keywords/Search Tags:Video content understanding, Deep learning, Attention mechanism, Caption generation
PDF Full Text Request
Related items