Research On Video Memorability Prediction Based On Multimodal Feature Fusion

Posted on:2022-10-17

Degree:Master

Type:Thesis

Country:China

Candidate:S Y Chang

Full Text:PDF

GTID:2558307118996189

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the massive increase in the number of online videos,a variety of videos have appeared since the introduction of content sharing platforms.Studies have shown that how many contents human beings can memorize varies with the videos they watch.Some videos can be remembered for a long time,while others can be forgotten instantly.Video memorability is a metric to describe how memorable a video is,and designing calculation models for video memorability prediction has great prospects in practical application.Therefore,this thesis works on how to effectively predict the video memorability.Memorability is an inherent attribute of images,and human beings have common preferences in memorable content.Different from images,videos are a fusion of image,sound,text and motion information,which are richer in content.Therefore,video memorability prediction is affected by more factors than image memorability prediction.Since single-modal prediction models fail to take into account all relevant factors,they often have poor performances in real-life video memorability prediction.So this thesis take video as its research object,explores the effect of text,image depth and motion information on video memorability.The main work of this thesis is to build an effective video memorability prediction model to improve the video memorability prediction performance,the main contents are as follows:(1)To study the effect of video title and image depth factors on video memorability,a video memorability prediction model fusing text and depth visual features is proposed.Firstly,text features are extracted by the TF-IDF algorithm from the titles,and words that affect the memorability of a video are given a certain weight.Secondly,videos are preprocessed frame by frame,and the depth estimation model is used to extract depth feature maps as the depth information,we use the pre-trained Res Net-152 network to extract the video visual features,and use the fine-tuned Res Net-152 network to extract the depth features by depth maps dataset,we concatenate the depth features and visual features to obtain depth visual features.Then text features and depth visual features are used to predict the video memorability scores by a regression algorithm.Here we adopt weighted averaging of late fusion as our model fusion method.Finally,comparative experiments are conducted on a large public dataset,and our model achieves a Spearman’s rank correlation of 0.547(respectively 0.260)for short-term(resp.longterm)memorability prediction,which proves the effectiveness of the proposed model.(2)To solve the issue that the existing video memorability prediction models can’t describe well the impact of motion information on video memorability prediction,and to further improve the prediction performance of video memorability,motion features are added to the existing video memorability prediction model fusing of text and depth visual features,and the motion information is described in the form of optical flow.Then a multimodal fusion video memorability prediction model is proposed,which integrates text information,image depth and optical flow information.First,the optical flow estimation model is used to extract the optical flow maps,and we use the finetuned Res Net-152 network to extract the optical flow features by optical flow maps dataset.Then features of the aforementioned three dimensions are applied individually to predict the video memorability scores by a regression algorithm.Finally,the features of the three modes are fused.A series of comparative experiments are conducted on a public dataset,the results of which achieve a Spearman’s rank correlation of 0.567(short-term)and 0.272(long-term)respectively,proving that the proposed multimodal feature fusion method can improve the video memorability prediction performances.(3)This thesis applies the video memorability prediction model based on multimodal feature fusion to a company’s encoded stream pusher to predict advertisement memorability,and proceeds to design and analyze an advertisement memorability prediction module.Video memorability prediction experiments are carried out on mobile phone advertisements,and experimental results are analyzed,which proved that the model proposed in this thesis can effectively predict the memorability of advertisements.

Keywords/Search Tags:

Video memorability, Text features, Depth visual features, Optical flow features, Multimodal feature fusion

PDF Full Text Request

Related items

1	Extracting High-level Multimodal Features
2	Research Of Video Summarization Based On Multimodal Features Fusion
3	Visual Semantic Understanding And Question Answering Research Based On Knowledge Graph
4	Technology Research, Content-based Visual Information Retrieval
5	Research And Application Of Facial Iris Dual Feature Fusion Recognition Algorithm Based On Deep Learning
6	Key Point Extraction Of Architectural Images Based On Fusion Of Traditional Features And Convolutional Features
7	Research On Methods Of Behavior Recognition Using Feature Fusion
8	Research On Abnormal Event Detection In Intelligent Surveillance Video Based On Hybrid Features
9	A visual analysis of articulated motion complexity based on optical flow and spatial-temporal features
10	Personal Identification System Based On Fusion Features Of Fingers