Font Size: a A A

Acoustic Scene Classification With Mismatched Recording Devices

Posted on:2021-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:S W JiangFull Text:PDF
GTID:2428330623468331Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Sounds carry a large amount of environmental information.Recognizing the environmental information in the recorded audio can determine the acoustic scene in which the recorded audio is located.Acoustic scene classification is aimed to recognize acoustic scene where recorded audio is located.In recent years,acoustic scene classification has gradually become a new research field.With the popularization of this research and its application in wearable devices,the problem of mismatched recording devices may be appeared.Which results in research on acoustic scene classification with mismatched recording devices.The dataset used in this thesis is the DCASE(Detection and Classification of Acoustic Scenes and Events)2019 task 1B dataset.DCASE is a global competition authorized by the IEEE AASP(Audio and Acoustic Signal Processing).The dataset provided by DCASE 2019 is used by a large number of researchers at home and abroad.In this paper,according to the feature extraction method and convolutional neural network used in the baseline system provided by DCASE,the average accuracy of the device is finally 41.4%.Then,logarithmic mel spectrum is used to extract the features Based on the feature extraction,spectrogram decomposition method including HPSS,NNF,and Vocal separation and HRTF method are used for audio preprocessing.The same VGGNet network structure is used to train obtained features,and the trained models are used for ensemble,so that the average accuracy of devices is further improved.Finally,the average accuracy of devices obtained by using the spectrum decomposition method reaches 64.0%,and the average accuracy of devices obtained by using the HRTF method is increased to 65.1%.On the basis of audio preprocessing,while extracting features for all audio clips in the dataset,spectrum correction is also used to correct the spectrum of the audio clips,thus the differences between devices are reduced,and their audio clips become more similar.The extracted features are trained by VGGNet network structure.Compared to the results without spectrum correction for every single model,the average accuracy of devices obtained by audio preprocessing using the spectrum decomposition method is increased by up to 8.8%.The average accuracy of devices obtained by audio preprocessing using the HRTF method is increased by up to 5.9%.Based on audio preprocessing and spectrum correction,the VGGNet network structure is modified to the ResNet network structure,Mixup is used for data augmentation for extracted features,and focal loss is used as loss function for training.The results of the single model are compared with using VGGNet network structure,in addition to vocal separation of audio preprocessing algorithm,for single model,the average accuracy of the devices has been greatly improved,up to 10.3%.Finally,the trained models are further used for ensemble,and the class-weighted method is added.The average accuracy of devices reaches the highest result as 73.9%,which is 32.5% higher than results of baseline system.
Keywords/Search Tags:Acoustic scene classification, Log mel spectrogram, VGGNet, Spectrum correction, ResNet
PDF Full Text Request
Related items