Font Size: a A A

Video Semantic Analysis Based On Multimodal Features

Posted on:2021-11-04Degree:MasterType:Thesis
Country:ChinaCandidate:W YuanFull Text:PDF
GTID:2518306308975699Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
In the age of the information society,video data is showing an explosive growth trend.It's even more important to get valuable key information from massive videos.With the rapid development of deep learning technology,video understanding has become a research hotspot.This paper studies video semantics from the aspects of video bright spot detection,time-series action positioning,and so on.It has certain practical application value in saving video browsing time and saving the cost of video summary.In view of the fact that video data contains multiple modal information of images,text,and audio,this paper proposes Multimodal Analysis Approach(MAA)for real-time automatic editing of video highlights.At present,most of the work related to extracting bright spots uses a single method to detect single or multiple bright spots.And this paper combines a variety of algorithms in the field of computer vision to model a variety of video semantic information to achieve more diverse bright spot detection.Four rich sports video-centric datasets were constructed for multimodal analysis.Test the live stream by automatically editing the highlight experiment system.On the one hand,verify the experimental results against the defined exciting events.On the other hand,the scoring mechanism is used to analyze and compare the highlights videos produced by professional video editors to evaluate the quality of the highlights,thereby proving the effectiveness of MAA.Aiming at the untrimmed and lengthy video,this paper proposes a new temporal action localization algorithm,and designs Multi-stream Temporal Network(MTN)and Proposal Scoring Network(PSN).Multi-dimensional encoding of video content,such as image,audio,and motion,is used as the input of the network to enrich feature encoding information.Early Fusion is used to perform time-series calculations on the characteristics of different modalities.The purpose is to maximize the use of certain modal characteristics that are more advantageous for positioning real target movements.The intermediate layer feature fusion branches of different modals are added,and the temporal convolution layer used to extract the feature level of the video proposal,so that the multi-modal information is deeply fused and can highly characterize the key content of the video.Finally,the effectiveness and advantages of the proposed algorithm are demonstrated on two public datasets and one manually created dataset.
Keywords/Search Tags:Deep Learning, Multimodal Feature, Highlight Extraction, Temporal Convolution
PDF Full Text Request
Related items