Font Size: a A A

Research On Video Object Detection Based On Multimodality And Knowledge Distillatio

Posted on:2024-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:H J WangFull Text:PDF
GTID:2568307106481544Subject:Electronic information
Abstract/Summary:PDF Full Text Request
Object detection plays a key role in a range of real-world applications,including video surveillance,autonomous driving,unmanned aerial vehicles and other fields.But in these fields,videos are more prevalent than images,and efficient processing of video data has always been desirable in these fields.At present,video object detection faces many difficulties and challenges.Firstly,the use modality is single and the information obtained in the scene is insufficient.Second,the static image object detector is difficult to directly apply to video object detection,and some complex designs are often added,which increases the scale of the model while obtaining accuracy,resulting in difficulties in deployment on embedded devices.Third,the detection ability of the model is reduced due to occlusion,blur,and out-of-focus of the object in the video stream.In response to the above challenges,in the field of autonomous driving,we gradually move from visual modal fusion to audio-visual modal fusion,and obtain a network model with low parameters and low computational complexity while maintaining the effectiveness and robustness of model detection.Specifically,the main contributions of this paper are as follows:(1)Aiming at the problems of how to make better use of context information to learn more spatio-temporal features while fusing RGB and Depth modalities,and the current object detection model based on RGB-D multi-modal fusion,the fusion path of RGB and Depth modalities is usually single,which is not enough to integrate all the information from the two modalities.We propose a novel Cross-modal Multipath Fusion Network(CMFNet)to achieve better multimodal video object detection performance.In contrast with the previous singlemodal video object detection methods and single-path multi-modal fusion methods,the core idea of CMFNet is to learn the spatio-temporal features and multi-modal features of video context,and improve the efficiency of multi-modal fusion.And multiple dense blocks are integrated into the network to better complete the modal information integration from low-order features to high-order features.It can support the information fusion of RGB and Depth modalities in the feature extraction process,and obtain mixed features of different scales,making the fusion process more flexible and comprehensive.In addition,a Cross-frame Feature Alignment(CFA)algorithm is proposed to propagate high-level features across frames and learn the corresponding spatial distribution between different frames,which makes the propagation and alignment of features between frames more accurate and enhances the detection of objects such as occlusion,blur,and appearance change.(2)In view of the problem that the perception ability of the visual sensor is affected and the performance of the object detector is also reduced in the scene with insufficient light,we start from making full use of the cues in the specific scene,on the basis of the fusion of visual RGB and Depth modality,we introduce a new mode-Audio modality to perform video object detection combined with audio-visual cues.In addition,a new Audio-visual Distillation Network(AVDNet)is proposed,which can fuse RGB,Depth modalities and Audio modalities.The network uses the correspondence between video and audio to train the network.Based on knowledge distillation technology,a simple Mask Learning method is designed,which is based on CMFNet as the teacher network and transfers the knowledge of the object location from the visual modality to the audio modality.In order to further utilize the complementary information between RGB,Depth modality and Audio modality,the whole knowledge distillation framework completes the training in a self-supervised manner.On the basis of CMFNet,a feature alignment loss function is designed to align the complementary cues of the middle layer of the teacher network and the audio student network.
Keywords/Search Tags:Video object detection, Multi-modal fusion, Knowledge distillation, Autonomous driving
PDF Full Text Request
Related items