Font Size: a A A

Acoustic Scene Recognition Based On Sparse Representation And Deep Neural Networks

Posted on:2021-01-19Degree:MasterType:Thesis
Country:ChinaCandidate:C LinFull Text:PDF
GTID:2428330602966204Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Audio scene recognition is to judge the scene by understanding the environmental sound.It has a broad application prospect in real life,for example,it can be widely used in multimedia retrieval,smart home,intelligent robot,security monitoring,and intelligent terminal etc.,and then has important research value.In this paper,we studied audio scene recognition based on sparse representation and deep neural network,and proposed four fusion methods for audio scene recognition.1)A front-end feature fusion method is proposed.For each audio frame in the audio segment,the front-end feature fusion method combines the score feature obtained based on sparse representation and the log-mel feature to form the new feature.The audio segment with the new fused feature is taken as the input of DCNN network for audio scene recognition.The score feature based on sparse representation reflects the distribution of scene classes from the perspective of audio base space,while the log-mel feature reflects the acoustic feature of audios.These two groups of completely different features mine audio information from different perspectives,and the two types of features complement each other,so that the new fused feature is more informative than each single type of feature.2)A back-end feature fusion method is proposed.The back-end feature fusion method takes the score feature obtained based on sparse representation and log-mel feature as the input of DCNN network respectively,and then depth features are extracted from DCNN,finally,the two types of depth feature are combined to perform audio scene recognition through DCNN network.The recognition performance of back-end feature fusion method is better than that of the method before fusion,and is better than that of front-end feature fusion method on the whole.3)A decision value fusion method based on sparse representation and log-mel spectrum is proposed.In this fusion method,the score feature obtained based on sparse representation and log-mel spectrum feature is taken as the input of DCNN network respectively,and then the decision values of each network output are fused through element multiplication,such fused decision values are then used for classification.This fusion method uses DCNN network to mine classification information from two aspects of sparse representation feature and log-mel feature.The fused decision values have integrated the classification advantages of both feature set,and then have stronger classification ability than the decision values before fusion.4)A decision value fusion method based on sparse depth feature is proposed.In this method,first,the depth features extracted by VGG16 network and LSTM network are performed sparse processing,and then the decision values are obtained by VGG16 network and LSTM network respectively based on the sparse features,finally,the decision values are fused by element multiplication,and the fused decision values are used for classification.This fusion method can integrate the classification ability of VGG16 network and LSTM network,and also it can make use of the advantage of sparse coding,in that way,it has achieved better classification performance than the VGG16 network and the LSTM network.5)The multi-channel fusion is performed for each of the proposed fusion method.For stereo audio data,the proposed fusion method first obtains the decision values based on the left,right and mono channel respectively,and then,in order to make full use of the information of each channel,it fuses the decision values obtained based on each channel signal by element multiplication.Multi-channel fusion can make full use of the information provided by each channel,so it can improve the classification performance effectively.Compared with the left channel,right channel and mono channel,multi-channel fusion can achieve the best classification results.
Keywords/Search Tags:Sparse representation, Deep neural network, Score calibration, Fusion strategies, Audio scene recognition
PDF Full Text Request
Related items