Font Size: a A A

Violence Detection Based On 3D Attention And Cross-modal Self-distillation Network

Posted on:2024-09-02Degree:MasterType:Thesis
Country:ChinaCandidate:J X WangFull Text:PDF
GTID:2568307058971859Subject:Electronic information
Abstract/Summary:PDF Full Text Request
Violence detection is an important component of video understanding technology which is one of the important applications in the field of computer vision.Violence detection technology can help users avoid the negative effects of inappropriate media on the internet,and it can also be used in video surveillance systems to help law enforcement personnel detect illegal activities.Therefore,it has received widespread attention from researchers in the field of computer vision and has been widely used in intelligent video review,intelligent video surveillance,intelligent city management,and other fields.In order to further improve the performance of violence detection methods in extracting key temporal information and reducing modal noise and modal asynchrony caused by audiovisual feature fusion,this article modifies and optimizes the deep detection network through introducing attention module and self-distillation.The contributions are as follows:To alleviate the problem of insufficient attention to spatio-temporal features,this paper proposes a violence content detection algorithm based on 3D attention enhancement.The algorithm is based on an improved 3D Dense Net model,which introduces P3 D convolution to extract low-level spatio-temporal information and effectively reduces the large number of parameters brought by 3D convolution.To further improve the performance of the model,we introduce spatial-channel attention and temporal transition layers embedded temporal attention to extract key spatio-temporal information.This forms a three-dimensional attention of spatial-channel-time through which the multidimensional discriminative information of features can be highlighted.Experimental results show that the algorithm achieved accuracies of 98.75%,100%,and 89.25% on the datasets of Hockey,Movies,and RWF-2000 respectively,and in the long video violence content localization experiments,the algorithm achieved better detection performance on the VSD2014 dataset,demonstrating the algorithm’s generalization ability in violence content detection.To fully utilize the multimodal information in videos and address the issues of modal context modeling and modal asynchrony,this paper proposes a cross-modal self-distillation violence detection algorithm based on audio-visual fusion.The algorithm introduces audio features into visual features through a cross-modal attention mechanism to extract important parts from the fused features,thereby enhancing the audio-visual fused feature.Then,a selfdistillation module is introduced in the model,allowing the model to transfer the knowledge learned from the visual stream to the audio-visual stream,thereby effectively reducing the modal noise and modal asynchrony introduced by the interaction between the visual and audio features in the cross-modal attention module.The algorithm achieved an average precision of 84.7% on the XD-Violence dataset.
Keywords/Search Tags:Violence dection, Attention mechanism, Self-Distillation Network, Dense Net, Cross-modal attention mechanism, RWF-2000, XD-Violence
PDF Full Text Request
Related items