Video Semantic Analysis Based On Multimodal Features

Posted on:2021-11-04

Degree:Master

Type:Thesis

Country:China

Candidate:W Yuan

Full Text:PDF

GTID:2518306308975699

Subject:Electronic Science and Technology

Abstract/Summary:

PDF Full Text Request

In the age of the information society,video data is showing an explosive growth trend.It's even more important to get valuable key information from massive videos.With the rapid development of deep learning technology,video understanding has become a research hotspot.This paper studies video semantics from the aspects of video bright spot detection,time-series action positioning,and so on.It has certain practical application value in saving video browsing time and saving the cost of video summary.In view of the fact that video data contains multiple modal information of images,text,and audio,this paper proposes Multimodal Analysis Approach(MAA)for real-time automatic editing of video highlights.At present,most of the work related to extracting bright spots uses a single method to detect single or multiple bright spots.And this paper combines a variety of algorithms in the field of computer vision to model a variety of video semantic information to achieve more diverse bright spot detection.Four rich sports video-centric datasets were constructed for multimodal analysis.Test the live stream by automatically editing the highlight experiment system.On the one hand,verify the experimental results against the defined exciting events.On the other hand,the scoring mechanism is used to analyze and compare the highlights videos produced by professional video editors to evaluate the quality of the highlights,thereby proving the effectiveness of MAA.Aiming at the untrimmed and lengthy video,this paper proposes a new temporal action localization algorithm,and designs Multi-stream Temporal Network(MTN)and Proposal Scoring Network(PSN).Multi-dimensional encoding of video content,such as image,audio,and motion,is used as the input of the network to enrich feature encoding information.Early Fusion is used to perform time-series calculations on the characteristics of different modalities.The purpose is to maximize the use of certain modal characteristics that are more advantageous for positioning real target movements.The intermediate layer feature fusion branches of different modals are added,and the temporal convolution layer used to extract the feature level of the video proposal,so that the multi-modal information is deeply fused and can highly characterize the key content of the video.Finally,the effectiveness and advantages of the proposed algorithm are demonstrated on two public datasets and one manually created dataset.

Keywords/Search Tags:

Deep Learning, Multimodal Feature, Highlight Extraction, Temporal Convolution

PDF Full Text Request

Related items

1	Research On Emotion Recognition Method Based On Multimodal Deep Learning
2	The Design And Realization Of Spatial-Temporal Feature Extraction And Recognition Algorithm For Human Action Analysis
3	Research On Video Action Recognition Method Based On Spatial-Temporal Feature Fusion And Deep Learning
4	Eye Tracking Technique Based On Deep Multimodal Learning
5	Research On Image Highlight Removal Based On Deep Learning
6	Research Of Video Spatio-temporal Feature Extraction And Retrieval Algorithm Based On Deep Learning
7	Research On Anomaly Detection In Surveillance Videos Based Algorithm On Deep Neural Network
8	Multimodal Emotion Recognition Algorithm Based On Deep Learning
9	Research And Application Of Multimodal Learning For Heterogeneous Feature Fusion
10	Research Of SAR Feature Extraction And Target Recognition Based On Deep Learning