Font Size: a A A

Environmental Sound Recognition Based On Deep Learning

Posted on:2021-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:Z C ZhangFull Text:PDF
GTID:2428330614956796Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
In the study of audio information,environmental sound recognition(ESC)is an important problem,which means that the computer can model the human auditory response to analyze a shortterm audio signal,understand and recognize the signal and then predict a predefined category label.Environmental sound itself can convey lots of important information,which can help people monitor environmental conditions and analyze acoustic scene.At present,environmental sound recognition is widely used in the fields of security monitoring,medical monitoring,machine intelligence,and human-computer interaction.Environmental sound recognition is challenging in nowadays.On the one hand,unlike speech and music,environmental sound has complex spectral characteristics and temporal structure.In terms of spectral characteristics,environmental sounds may be either tonal(exhibiting distinct peaks in the spectrum,e.g.siren)or noise-like,whose spectral power spans a broad frequency band,such as wind.In terms of temporal structure,environmental sounds may be transient,intermittent or continuous.Therefore,how to design an effective recognition system based on the characteristics of environmental sounds is an important and challenging problem.On the other hand,the lack of training data is another difficulty in developing ESC systems.How to use the limited data to ensure the generalization performance of the model is another important problem.In order to solve the above problems,this paper mainly conducts research on environmental sound recognition from the following aspects:First,we delved into environmental sound recognition methods based on convolutional recurrent neural network.This system takes audio spectrum as input,which depicts the energy distribution of the sound signal,which can be learned through a convolutional neural network.In addition,the convolution kernel of the convolutional neural network has the ability to learn the detailed local information form the audio spectrum,which has been shown to be an important trait for distinguishing between different sound classes.What's more,environmental sound is essentially the sequence data which contains correlation information between adjacent frames.In order to make up the deficiencies of convolutional recurrent neural networks to learn sequential relationship,recurrent neural network is applied in order to model the sequential dynamics of environmental sound signals.Experimental results show that the recognition performance of convolutional recurrent neural networks is better than some typical deep learning models and traditional classification models.Secondly,we study deeply the data augmentation methods in environmental sound recognition tasks,and proposed an online data augmentation scheme environmental sound recognition tasks.At present,the amount of publicly available environmental sound datasets is relatively small,and the data distribution of training set and test set data are quite different.It is quite difficult to obtain good model generalization performance on limited training data.This article first describes the existing data augmentation methods,and proposes an online data enhancement scheme based on the existing technology.The proposed augmentation scheme directly processes the input audio spectrum during the training phase,which not only ensures the diversity of training samples,but also does not require additional data and computation costs with good flexibility.The proposed augmentation scheme has greatly improved the recognition performance on several public data sets.Finally,we propose an attention based environmental sound recognition model.An main difficulty in developing ESC systems is the complex and variable temporal and spectral characteristics.In this paper,we propose an attention mechanism to enable the deep neural network to automatically focus on semantically relevant features and discard irrelevant information such as noise or background parts.Specifically,to deal with complex temporal structure,we propose a temporal attention mechanism that enables the recognition model to give larger weights to semantically spectral frames,while smaller weights to noise frames.To deal with variable spectral characteristics,we propose a channel attention mechanism to filter out unrelated feature maps in the convolutional layer by utilizing the pattern detecting ability of convolutional kernels.Further,we combine the characteristics of the learning characteristics of the temporal attention mechanism and channel attention mechanism to design a joint attention mechanism with stronger learning ability.In experiments,we visualized the learned attention results,and the results show that our proposed attention model can make the network automatically focus on semantically relevant features and improve model recognition performance.The proposed methods are evaluated on several environmental sound recognition benchmark datasets: ESC-10,ESC-50 and DCASE2016,and the experimental results prove that the proposed method is effective to deal with environmental sound recognition tasks.
Keywords/Search Tags:Environmental Sound Recognition, Deep Learning, Convolutional Recurrent Neural Network, Attention Mechanism, Data Augmentation
PDF Full Text Request
Related items