Font Size: a A A

Cross-modal Semantic Alignment For Video Moment Retrieval

Posted on:2022-11-11Degree:MasterType:Thesis
Country:ChinaCandidate:G M WangFull Text:PDF
GTID:2518306764976279Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
With the development of the Internet and video technology,video content is being favored by more and more people.Every day,tons of videos around the world are shot,edited,and uploaded to the Internet.As the amount of video content grows exponentially every day,researchers are increasingly focusing on video retrieval techniques to process massive amounts of video information.As the length of the video increases,we hope to retrieve the video clip that best corresponds to the text from the longer video,which leads to the task of video moment retrieval.Video moment retrieval is based on a given text,from a long video,to find a segment most corresponding to the text query,and return the start time and end time of the segment.This task is also helpful for other video tasks,such as video question answering,video description generation,and video localization.The current mainstream video moment retrieval methods are mainly composed of the following three stages: multi-modal feature extraction,cross-modal fusion and video moment localization.In the multi-modal feature extraction process,video features and text features are extracted separately.Then,the features of different modalities are fused to obtain fused features.Finally,the fused features are fed into the video moment location network to generate the final video moment retrieval results.Although this process is proven to be effective for video moment retrieval,there is still a lot of room for improvement.For example,cross-modal fusion is not sufficient enough.Besides,multiple actions in the same video clip will interfere with each other.And the video representation is too rough.Based on above problems,the thesis proposes the following two different methods to improve the performance of video moment retrieval.To deal with the insufficient feature fusion of different modalities and the interference of multiple actions in the video,the thesis proposes Cross-modal Dynamic Network for Video Moment Retrieval with Text Query(CDN).According to the text and video features,this method dynamically generates the convolution kernel of the convolutional network,and guides the convolution of the cross-modal features.At the same time,this method also uses a novel frame selection module to capture different action features in the same video clip,thereby reducing the mutual interference caused by different actions in the same clip.In the inference process,these two mechanisms will not significantly increase the computational cost,and effectively improve the performance of video moment retrieval.Aiming at rough video representation,the thesis also proposes Language-enhanced Object Reasoning Network for Video Moment Retrieval with Text Query(LEORN).Different from using traditional video features,this method uses object-level visual features combined with semantic information to infer the relationship between different objects to understand video content.Moreover,the method uses a new temporal shift mechanism to avoid the interference caused by misaligned objects.The experiments on two challenging datasets,i.e.,Charades-STA and TACo S,show that the methods proposed in the thesis both achieve competitive performance in multiple metrics compared with the existing methods.
Keywords/Search Tags:Video Moment Retrieval, Video Understanding, Cross-modal Alignment, Moment Localization
PDF Full Text Request
Related items