Font Size: a A A

Research On Video Classification Based On Cross-Modal

Posted on:2021-03-05Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y MaFull Text:PDF
GTID:2428330620963992Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the advent of the multimedia era and the vigorous development of the mobile Internet,more and more people choose to record and share their lives by shooting and uploading videos.Video content recognition is an important research area in the field of computer vision.Industry and Academia have also paid extensive attention to it.It is the basis of video identification and classification.Traditional classification methods rely too much on the manual design of experts to extract features of each part of the video content.Deep neural networks have made outstanding contributions in audio recognition and image recognition.At present it has been promoted to video recognition which is image recognition research with continuous input of multiple frames.In this thesis,the academically-recognized UCF101,HMDB51 and Kinetics400 datasets are selected as experimental datasets.The focuses of this experiment are how to extract the features of different modalities in a single video and how to make full use of it.It is very difficult to use the limited features to greatly improve the accuracy of video classification after obtaining the video features.The main content of this thesis is to implement a multi-channel video classification model based on attention mechanism by using the python framework,which can realize the video classification with multi-modal feature input and different modal feature fusion.In the feature extraction stage,four kinds of features,including RGB image mode feature,optical flow mode feature,3D video mode feature and audio mode feature,are extracted.According to different features,they are input into the classification network separately to obtain the probability distribution of video content label.In the classification stage,this thesis uses the way of combining different modal features to do classification,and add attention mechanism in the feature extraction level.In the final fusion stage,feature splicing fusion,mean fusion and result fusion are used to do multiple contrast experiments.Finally,a multi-channel video classification model based on attention mechanism is established.On the basis of completing the classification task,the cross-modal retrieval task is expanded.By mapping the pre standardized features to the same semantic space,the distance between the video modal features and the image modal features is continuously shortened,and the two feature classes are hashed into 64 bit for mutual retrieval.This experiment is mainly based on neural network for classification and retrieval.Different from the traditional classification network,it only classifies the text labels of video,and does not dig into the characteristics of video itself,and excessively relies on artificial labels.The neural network can mine the similarity of the essential features of video,and get the probability distribution of each video tag by fitting the massive video features.In the experiment of video classification,the expected results are achieved in three experimental datasets.In the cross-modal retrieval experiment,the retrieval accuracy is 0.84 which surpasses the vast majority of previous algorithms.
Keywords/Search Tags:modal features, attention mechanism, video classification, cross-modality, video retrieval
PDF Full Text Request
Related items