Font Size: a A A

Construction Of Multimodal Dataset With Sign Language And Research On Video Captioning Method

Posted on:2022-05-02Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LiFull Text:PDF
GTID:2518306740483074Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Video captioning uses technologies in the fields of computer vision(CV)and natural language processing(NLP)to automatically generate natural language description text based on the content of the input video.Video data is widely used in people's daily life,and video captioning technology therefore has a wide range of application scenarios,such as human-computer interaction,video retrieval,video surveillance,intelligent navigation,and visual information assistance.In recent years,the video caption has received more and more attention.The construction and optimization of networks and the collection of datasets have become extremely challenging and key research contents in the field of artificial intelligence.Among them,multimodal video captioning has shown excellent performance.By using other information to supplement the video content,it can extract content features more comprehensively to improve the accuracy of video captioning.However,this method is currently limited to the assistance of audio and text.We consider that sign language,as one of the information modalities in our daily life,is an important way for deaf-mute people to communicate and interact,and it is also a potential expansion direction for multimodal video captioning technology research.Based on this idea,this thesis starts the study of sign language assisted video captioning(SLAVC)problem,and explores the influence of sign language on the effect of video captioning for the first time.The main contributions are as follows:(1)A new type of multimodal video captioning dataset containing sign language was constructed.We collected two-year program videos from the "Chinese Focus On"(CFO),cropped and extracted the sign language,visual,and audio modalities;at the same time,a video captioning tool was developed to assist markers to describe and classify video content in a standardized format;an online video captioning competition was organized using the constructed CFO dataset.From the results of the competition and the experiments on the classic model,it can be seen that the CFO dataset can be effectively applied to the research of video captioning task.(2)Based on the CFO dataset,a sign language assisted video captioning network(SLAVCNet)is proposed.The network extracts features of multimodal information including sign language,and uses the GRU module based on the attention mechanism to encode and decode the extracted features.At the same time,a global feature reconstructor is introduced to enhance the learning ability of the network.The experimental results of SLAVCNet on the CFO dataset showed that the assistance of sign language can effectively improve the effect of multimodal video captioning.In addition,the reconstructor module in SLAVCNet can enhance the network's ability to learn the correspondence between multimodal information and description text.
Keywords/Search Tags:Multimodal video captioning, Sign language, Dataset, Attention mechanism, Reconstructor
PDF Full Text Request
Related items