Font Size: a A A

Research On Environmental Sound Classification Method Based On Deep Learning

Posted on:2020-12-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y ChenFull Text:PDF
GTID:2428330578973735Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the field of Sound Information Retrieval(SIR),environmental sound classification(ESC)emerges as a hot issue in the area,which aims to identify the complex features extracted from various sound signals And their corresponding semantic tags contained in the scene,so that the surrounding environment is perceived and understood,and classified.The commonly used audio signal feature extraction method is MEL frequency cepstral coefficient(MFCC).Although this method has strong anti-interference ability and can capture the most recognizable part of the sound data,it only functions on the short-term characteristics of the signal,which fails to describe the thorough structural characteristics of the entire sound data.In recent years,deep learning technology has become more and more mature and as one of the most effective feature extraction methods,breakthroughs have been made in many fields such as machine learning,image recognition,and natural language processing.Convolutional neural network(convolutional neural network,CNN),a classical deep learning network framework.In particular,the convolutional neural network with pooling layer has its applications in the classification of urban sound sources.However,pooling operations often result in a large loss of information,which affects the accuracy of the classification results.Based on the fine structural feature analysis capability of CNN,this paper analyzes the structural features of the traditional audio signal feature extraction method MFCC and explores a better deep learning method to solve the traditional audio scene classification problem.Firstly,through the classical model CNN,and the use of the special expansion convolution method,it is found that the dilated convolution can enlarge the receptive field range because the original parameters are not increased due to its ”gridding” structure.Overlays more frames,which is a good replacement for traditional convolution operations with pooling operations.At the same time,through further research on the characteristics of the dilated convolution structure,it is found that magnifying dilation date or the number of expanded convolution layers,will reduce the experimental precision.It is believed that there is an inherent ”gridding” defect in the expanded convolution model,which ignores a lot of information,and the excessive receptive field makes the frame too large to obtain the characteristics of the sound signal changing with time.It is foreseeable that there are still many things worthy of further exploration in the future work of audio scene classification based on deep learning.The main research contents and achievements of this paper are as follows:(1)To introduce the audio signal processing problems and summarize deep learning techniques.We find that traditional audio signal processing can only analyze the short-term characteristics of signals.The subsequent steps,which mainly based on the application of general shallow classifiers,are complicated.By studying the typical deep learning method,we find a suitable structured model for practice in speech recognition classification.(2)There are various methods for deep learning.Different structures have different sensitivity to different scene features,and there is a gap in recognition performance.This paper studies the audio scene feature extraction and classification method of traditional convolutional neural network with pooling operation.In the model design,the idea of dilated convolution is introduced.The convolution operation of this special structure yields better results in the urban sound source dataset than does the traditional convolutional neural network.(3)An in-depth study of the influence of the structure of the expansion convolution on the experimental results,and found that the expansion of the expansion rate or the number of dilated convolution layers will reduce the accuracy of the experimental classification.We owe this outcome to a mismatch,that is,the audio signal has short-term stability while the expansion model a ”gridding” connection defect.After the MFCC-processed feature passes through the grid structure,the coverage frame range severely changed which ultimately affects the feature extraction of the overall audio signal.
Keywords/Search Tags:Deep learning, Environmental sound classification, Convolutional neural network, Dilated convolution
PDF Full Text Request
Related items