Research On Video Object Detection Based On Multimodality And Knowledge Distillatio

Posted on:2024-05-20

Degree:Master

Type:Thesis

Country:China

Candidate:H J Wang

Full Text:PDF

GTID:2568307106481544

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

Object detection plays a key role in a range of real-world applications,including video surveillance,autonomous driving,unmanned aerial vehicles and other fields.But in these fields,videos are more prevalent than images,and efficient processing of video data has always been desirable in these fields.At present,video object detection faces many difficulties and challenges.Firstly,the use modality is single and the information obtained in the scene is insufficient.Second,the static image object detector is difficult to directly apply to video object detection,and some complex designs are often added,which increases the scale of the model while obtaining accuracy,resulting in difficulties in deployment on embedded devices.Third,the detection ability of the model is reduced due to occlusion,blur,and out-of-focus of the object in the video stream.In response to the above challenges,in the field of autonomous driving,we gradually move from visual modal fusion to audio-visual modal fusion,and obtain a network model with low parameters and low computational complexity while maintaining the effectiveness and robustness of model detection.Specifically,the main contributions of this paper are as follows:(1)Aiming at the problems of how to make better use of context information to learn more spatio-temporal features while fusing RGB and Depth modalities,and the current object detection model based on RGB-D multi-modal fusion,the fusion path of RGB and Depth modalities is usually single,which is not enough to integrate all the information from the two modalities.We propose a novel Cross-modal Multipath Fusion Network(CMFNet)to achieve better multimodal video object detection performance.In contrast with the previous singlemodal video object detection methods and single-path multi-modal fusion methods,the core idea of CMFNet is to learn the spatio-temporal features and multi-modal features of video context,and improve the efficiency of multi-modal fusion.And multiple dense blocks are integrated into the network to better complete the modal information integration from low-order features to high-order features.It can support the information fusion of RGB and Depth modalities in the feature extraction process,and obtain mixed features of different scales,making the fusion process more flexible and comprehensive.In addition,a Cross-frame Feature Alignment(CFA)algorithm is proposed to propagate high-level features across frames and learn the corresponding spatial distribution between different frames,which makes the propagation and alignment of features between frames more accurate and enhances the detection of objects such as occlusion,blur,and appearance change.(2)In view of the problem that the perception ability of the visual sensor is affected and the performance of the object detector is also reduced in the scene with insufficient light,we start from making full use of the cues in the specific scene,on the basis of the fusion of visual RGB and Depth modality,we introduce a new mode-Audio modality to perform video object detection combined with audio-visual cues.In addition,a new Audio-visual Distillation Network(AVDNet)is proposed,which can fuse RGB,Depth modalities and Audio modalities.The network uses the correspondence between video and audio to train the network.Based on knowledge distillation technology,a simple Mask Learning method is designed,which is based on CMFNet as the teacher network and transfers the knowledge of the object location from the visual modality to the audio modality.In order to further utilize the complementary information between RGB,Depth modality and Audio modality,the whole knowledge distillation framework completes the training in a self-supervised manner.On the basis of CMFNet,a feature alignment loss function is designed to align the complementary cues of the middle layer of the teacher network and the audio student network.

Keywords/Search Tags:

Video object detection, Multi-modal fusion, Knowledge distillation, Autonomous driving

PDF Full Text Request

Related items

1	Research On Deep Neural Network For Object Detection From Multi-modal Images
2	Research On Knowledge Distillation Algorithm For 3D Object Detection
3	Research On Open-Set Object Detection Method Based On Multi-Modal Learning
4	Research On 3D Environment Perception Algorithm Based On Multi-modal Sensor Fusion
5	Research On Object Detection Algorithm Based On Knowledge Distillation
6	Research On Point-voxel-based LiDAR 3D Object Detection
7	Research On Lightweight Multiple Object Tracking Algorithm Based On Knowledge Distillation
8	Research On Knowledge Distillation Methods For Object Detection Networks
9	Research On Video Object Detection Based On Multi-Attention Mechanism
10	Human Action Recognition Algorithm Based On Multi-modal