Font Size: a A A

Research On Deep Learning For Sound Event Detection

Posted on:2023-08-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y B WangFull Text:PDF
GTID:1528306911481134Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Sound Event Detection(SED)needs to identify the sound events in a recording and detect the onset and offset times of them,which is a research hotspot in multimedia signal processing.Audio signals are accompanied by various environmental noises,and different kinds of sound events often present simultaneously with long intra-class distance and short interclass distance,which bring great challenges to sound event detection.Recently,with the development of deep learning,the accuracies of SED methods have been improved greatly,but there are several shortcomings left to be solved,that are limited temporal dependencies,uncontrollable temporal dependencies,training challenge,low efficiency and class imbalance.This thesis analyzes these existing shortcomings and puts forward the corresponding solutions.The main research contents are shown below:1.Among existing sound event detection systems,Recurrent Neural Networks(RNN),such as long short-term memory unit and gated recurrent unit,is used to capture temporal dependencies,but it is confined in its length of temporal dependencies,resulting in a failure to model sound events with long duration.What’s more,RNN is incapable to process datasets in parallel,leading to low efficiency and low industrial value.Given these shortcomings,this thesis proposes to use convolution to capture temporal dependencies,instead of RNN.After analyzing the advantages and disadvantages of the existing convolution networks,this thesis proposes to model temporal dependencies by dilated convolution.Based on that,this thesis proposed Single-Scale Fully Convolutional Networks(SS-FCN)and Multi-Scale Fully Convolutional Networks(MS-FCN).With SS-FCN,this thesis discusses the influence of temporal dependent length on the detection performance and observes that SS-FCN modeling a single length of temporal dependencies achieves superior detection performance for finite kinds of events.With MS-FCN,this thesis verifies that fusing features with different length of temporal dependencies can make the networks show higher detection performance for various audio events.2.For the ignorance of neighboring information and fine-grained dependencies in MSFCN,this thesis proposes the dilated mixed convolution module,which fuses standard convolution and dilated convolutions with the former to capture the neighboring information and fine-grained dependencies and the latter to capture long-term dependencies,and transforms point sampling to region sampling.For the shortcoming that MS-FCN neglects intermediate-length temporal dependencies,this thesis proposes Dilated Temporal Pyramid Pooling module(DTPP),which can capture the temporal dependencies ignored by cascaded dilated convolutional module.By fusing the features output from DTPP and cascaded module,this thesis proposes cascaded parallel module,which can capture richer long-short dependencies.3.For the class imbalance problem and low utilization of effective samples,this thesis introduces Support Vector Machine(SVM),and combines soft margin SVM with Cross Entropy loss function(CE-loss)to propose the soft margin CE-loss.As SVM,soft margin CE-loss can adaptively select support vector machine in dataset to guide the training of networks.To utilize dataset more sufficiently,the thesis proposed hybrid CE-loss,which takes the advantages of soft margin CE-loss and CE-loss.For the insufficient ability of dilated convolution to capture long short-term dependencies,this thesis proposes HDC-Inception module,which can alleviate the “Gridding issue”of dilated convolution and model long short-term context information.The proposed methods achieve competitive performance and all of them are fully convolution framework,so they can complete sound event detection quickly and precisely,with a high industrial value.
Keywords/Search Tags:Sound event detection, Temporal context information, Fully convolutional network, Multi-scale information, Dilated Temporal Pyramid Pooling, Support vector machine
PDF Full Text Request
Related items