Font Size: a A A

Research On Semantic Analysis And Understanding Of Multimodal Video

Posted on:2024-09-18Degree:MasterType:Thesis
Country:ChinaCandidate:Q C TanFull Text:PDF
GTID:2568307079470894Subject:Electronic information
Abstract/Summary:PDF Full Text Request
With the continuous upgrading of storage technology and the continuous development of deep learning,the storage and processing of multimodal data is also easier and more efficient,which further provides support for research based on multimodal video learning.At present,the task of audio-visual event location and audio-visual video parsing is one of the main research tasks in the field of multimodal video learning.This task also has a wide range of applications in the fields of human-computer interaction,video surveillance and other fields.However,there are still three main problems in some current research methods:(1)When paying attention to different modalities,they only unilaterally processes the visual information without considering the cross-focus between the two modalities of audio and video and the special processing of audio.(2)The exploration of the relationship between the two modalities of audio and video is not fully explored,and the processing methods are only based on the simple attention mechanism,which cannot effectively use the key information on different segments.(3)In the case of weak supervision,it is not effective to aggregate the prediction results at the segment level only by the average pooling methods.At present,the results of different segments for different modalities can not be aggregated specifically to form more robust prediction results at the video level.Therefore,in response to the above challenges,this thesis further does the following work on the audio-visual event location and audio-visual video parsing tasks:(1)In terms of cross-focus between audio and visual modalities for audio-visual event location,this thesis uses the pre-processed audio-visual fusion features as the guidance to focus on the key areas related to the corresponding audio modality in the whole visual area while eliminating the interference from the visual background area,and pay attention to the audio modality information.In addition,the information of the corresponding audio features is enhanced by means of residual connection.(2)On the exploration of the relationship between audio and visual modalities within and between segments for audio-visual video parsing,on the one hand,this thesis explores the relationship between the internal segments of audio and visual modalities based on the self-attention mechanism? on the other hand,this thesis model the relationship between segments of different modalities use the algorithm which is similar with the processing algorithm for different types of nodes in the heterogeneous graph,and the relationship between segments is updated by comparing the similarity value between segments with the threshold value.Finally,the information brought by other segments for the current segment is aggregated by attention weighting.(3)On the prediction under weak supervision,for the task of audio and video event location,this thesis based on the method of multi-instance learning,through pooling aggregation with the predicted event correlation scores and event category scores to get the prediction results at the video level,while for the task of audio-visual video parsing,based on the multimodal multiple-Instance learning method,the network is constrained by the loss of pooling with attention and the contrastive learning loss,and a three-branch learning method is used to get better prediction for each modality.(4)In this thesis,the proposed algorithms in each research scenario are fully verified by experiments,and the effectiveness of the algorithms are further proved by comparative analysis and ablation experiments with the latest methods in the related fields.
Keywords/Search Tags:Multimodal, Cross-attention, Audio-visual learning, Audio-visual event location, Audio-visual video parsing
PDF Full Text Request
Related items