Font Size: a A A

Research On The Violent Detection Of Audio And Video Based On Attention Mechanism

Posted on:2022-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:X Y MinFull Text:PDF
GTID:2518306572459844Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Violence detection has a very important position in the field of audio and video detection and has great research significance.Violent behavior can be quickly detected in the security field,which helps reduce personal injuries.Detecting violent behavior in sports competition contributes to the fairness of the game.And with the development of the Internet and streaming media,manual detection cannot meet the speed requirements,which requires a better method to complete brute force detection.At present,most violence detection is mainly in video,and the detection type is single,which requires a violence detection technology that combines multiple modal features.Firstly,this article uses a new network model to detect audio and video brute force respectively.The video and audio are aligned and divided into frames.In terms of video modalities,a period of frame sequence is used as the input of Convolutional Neural Network(CNN)to extract corresponding features.After that,the corresponding feature information is sent to the optimized Convolutional Long ShortTerm Memory(LSTM),a series of hidden layer states are used to obtain the corresponding weights using softmax,and the corresponding hidden layers are allocated according to the weights State,and finally get the classification probability through the fully connected layer.In terms of audio modalities,first generate the corresponding spectrogram of the audio of the corresponding frame length,input the corresponding spectrogram sequence into the same network model as the video detection,and finally obtain the classification probability.Secondly,this paper conducts violence detection based on the dual-modal fusion of vision and hearing.Two CNN+Conv LSTM are used to obtain the hidden layer output of the visual mode and the auditory mode,and then the two outputs are weighted and summed and normalized to-1?1 by the tanh function,and then each hidden layer state is obtained by softmax Corresponding weights and assign the corresponding weights to the corresponding hidden layer output,and finally get the classification probability through the fully connected layer.Finally,this article uses attention mechanism and two-direction network to optimize.The multi-head self-attention mechanism is used in the extracted features,and the projections obtained in different subspaces are connected,and the corresponding fusion features are obtained and used as the input of Conv LSTM.For Conv LSTM,the forward and reverse are combined for optimization,and the two tensors are connected as the output of the hidden layer state,and then the proposed network model is used to obtain the corresponding results.This paper proves that the proposed CNN-Conv LSTM weight network architecture and feature fusion through audio and video modal subnetwork have improved the corresponding detection index in the corresponding data set compared with previous studies,and can be further optimized by using two-direction network and self attention mechanism.It has 98% accuracy on hockey fit data set,which is higher than the current highest 97%?The significance of this paper is to propose a better network architecture to improve the accuracy of violence detection,and in the future practical application can reduce a certain degree of false detection.
Keywords/Search Tags:attention mechanism, audio and video feature fusion, convolutional neural network, two-direction long and short-term memory network, violence detection
PDF Full Text Request
Related items