Font Size: a A A

Multimodal Processing Technology For Video Analysis

Posted on:2020-09-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:M LiuFull Text:PDF
GTID:1368330575956838Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nowadays,the ubiquity of cameras and social networks has increased the amount of visual media content generated and shared by people,especially the long videos and micro-videos.Due to the number of these videos is growing exponentially,long video and micro-video analysis have attracted wide attention from both industry and academic field.In the aspect of long video analysis,the development of deep learning technology has shifted its research from single visual understanding to visual-language understanding(e.g.,cross-modal moment localization in video).Hence,to solve this problem,it is necessary to build a visual and language comprehension model combining multi-modal interaction information.As for micro-video analysis,since micro-video is a unitv of multi-modalities.including social attribute,text description,audio,and visual.Therefore.it is very important to effectively extract features from multi-modalities and utilize them to express the micro-video.This dissertation focuses on the multi-modal learning methods for tackling the problems in video analysis.We respectively propose a multi-modal dictionary learning method and a multi-modal sequence modeling method to represent micro-videos,and applying them to estimate the venue categories of micro-videos.Besides,a memory attention based and language-temporal attention based cross-modal retrieval models are proposed to localize the moment in long videos via natural language.Main contributions of this dissertation are presented as follows:(1)Tree-guided multi-modal dictionary learning methodThis paper presents a tree structure-guided multi-modal dictionary learning model for micro-video venue category estimation,which co-regularizes the hierarchical smoothness and structural consistency within a unified framework.Besides,this paper also introduces an online algorithm to support our model.Generally speaking,if an incoming sample is labeled,we leverage it to strengthen the dictionary learning.Otherwise,we compute its sparse representation based on the current dictionaries and classify it into the right venue category.As far as we know,this is the first study on learning the sparse representations of micro-videos.(2)Deep multi-modal sequence modeling methodThis paper analyzes the parallel sequential structures and sparse properties of micro-videos and build up an end-to-end deep model accordingly for micro-video understanding.This model is capable of jointly capturing the sequential structures of three modalities and sparsity of micro-videos.And applying,the proposed model to a real-world micro-video application,i.e.,venue category estimation.the experimental results show that this model yields better performance than several state-of-the-art baselines.(3)Temporal memory and tensor fusion based cross-modal retrieval methodThis paper presents a novel attentive cross-modal moment retrieval network,which jointly characterizes the attentive contextual visual feature and the cross-modal feature representation.In order to accurately localize the moments in a video via natural language,a temporal memory attention network is introduced to memorize the contextual information for each moment,and treat the natural language query as the input of the attention network to adaptively assign weights to the memory representation.(4)Language-temporal attention based cross-modal retrieval networkWe present a novel cross-modal temporal moment retrieval approach,which can adaptively encode complex and significant language query information for localizing desired moments.We propose a language-temporal attention network which jointly encodes the textual query,local moment,and its context information to comprehend the query,therefore we can extract the key clues for localizing the desired moment.
Keywords/Search Tags:Temporal moment localization, Micro-video analysis, Language-temporal attention mechanism, Multimodal dictionary learning, Tensor fusion
PDF Full Text Request
Related items