With the development of digital sound analysis technology,sound event localization and detection technology is gradually being widely used,mainly in smart home,security monitoring,wildlife detection and abnormal sound event detection and other related fields.Sound event localization and detection refers to the process of identifying single or multiple overlapping sound events,identifying the activity time of the sound event,and estimating its direction relative to the microphone at the same time.Sound event localization and detection can be divided into two separate task bodies: sound event detection and sound source localization.Among them,sound event detection is a multi-label classification problem that aims to detect the onset and offset of sound events in time and further correlate text labels with the detected sound events;while sound source localization is the detection of the direction of the sound source relative to the microphone and is only used to estimate the direction in which the sound event is located,i.e.,the arrival direction estimation of the sound event.In this paper,we study a deep learning-based method for multiple sound event localization and detection,and discuss the difficulties of multiple sound event localization and detection.To address these difficulties,this paper conducts an in-depth study and does the following:(1)The network model of sound event localization and detection based on deep learning is difficult to capture the spatial and channel information of the input feature map accurately,which leads to the difficulty of sound event localization and detection.To improve the accuracy of sound event localization and detection,a dual attention-based sound event localization and detection network model(CECANet)is proposed.Firstly,a coordinate attention module is introduced in the residual module to make the network model focus more on the spatial coordinate information of the feature map,and then an efficient channel attention module is added before the average pooling layer to make the network model focus more on the channel information between features.Experimental results show that the proposed network model has an overall improvement in performance compared to the baseline model in the TAU-NIGENS Spatial Sound Events 2021 dataset,with F1 and LR improving to 0.720 and 0.728,and ER and LE decreasing to 0.393 and 11.71°;the improved model is able to improve the performance without ensuring a significant increase in sound event localization and detection network model while ensuring that the parameters and complexity of the model are not significantly increased,the improved model is able to improve the model prediction accuracy.(2)In order to improve the accuracy of sound event localization and detection based on deep learning and reduce the training time of the model,a sound event localization and detection network model(DSTN-SELDnet)based on deep separable convolutional and temporal convolutional networks is proposed in this paper.The network uses multichannel log-linear spectrograms as feature inputs,and replaces the original CNN module with a less computationally intensive deep separable convolutional module,and replaces the RNN part with a parallel computable temporal convolutional network.The proposed DSTN-SELDnet network model is experimented on the Tampere University of Technology(TUT)Sound Events2018 dataset for relevant experiments,which show that the improved model can reduce the parameters of the encoder module of the network model while ensuring no degradation in the accuracy of sound event localization and detection,and training time per batch,significantly improving the network model training speed. |