Font Size: a A A

Audio Scene Classification Based On Convolutional Networks

Posted on:2018-10-20Degree:MasterType:Thesis
Country:ChinaCandidate:L LuFull Text:PDF
GTID:2428330515489733Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
ASC refers to the task of associating a semantic label to an audio stream that identifies the environment in which it has been produced.It has attracted much interests of researchers due to its promising application.Smart phones,wearable devices and robots equipped with artificial intelligence may all benefit from ASC by providing different services and applications according to context by analyzing audio.Convolutional Neural Networks(CNN)has achieved an outstanding performance in acoustic scene classification.Features extracted in traditional way are deficiency in robustness and representation,and the problem can be solved by CNN.In this paper,we research on acoustic scene classification based on CNN by directly learn from the spectrogram.The spectrogram preserves the information in frequency and time domain so that it can improve classification accuracy.First,we propose a shallow CNN architecture for acoustic scene classification.It strikes a balance between the accuracy and computational complexity.For real world application,deeper architecture means more computational time and space.The historical and vertical strips appear in the spectrogram(Mel-spectrogram)of acoustic scene are regular and simple.Based on these characteristic,we progressively modify a baseline model while preserving its classification accuracy and decreasing its computational complexity.We present a shallow architecture with only 12%complexity compared to the baseline deep model.We conduct experiments on Detection and Classification of Acoustic Scenes and Events(DCASE)2017 datasets.The four-fold across validation on the dataset has a classification accuracy of 80.48%,with a nearly 6%improvement compared to the deep architecture.Second,we propose a new way to extract features which is based on the standard deviation of each acoustic scene,then fuse these features with Mel-spectrogram to augment the input of CNN.Mel-spectrogram is a spectrogram image feature based on human auditory system,but frequency components of acoustic scenes are different from human perception.We propose a data-driven way to down sample the spectrogram by analyzing the standard deviation of each acoustic scene in the training set.We used this down sample strategy to extract features and combined them with Mel spectrogram as the input of CNN.Results shows that feature fusion can further improve the accuracy by 1.4%,reaching 81.88%.Third,we use a hierarchical learning method by incorporating the hierarchical taxonomy information of acoustic scenes.The parameters of the CNN are initialized by the proposed hierarchical pre-training.The method is based on the feature fusion.15 classes are firstly divided into 3 categories.The we use these hierarchical labels to train 2 CNNs.Firstly,the CNN1 are trained to predict the three high-level acoustic scene classes,namely indoor,outdoor and vehicle.CNN2 are then trained to estimate the posterior of the 15 targets low-level acoustic scene classes with the initialized weights from CNN1.Results shows that pre-training can further improve the accuracy by 0.55%,and the final hierarchical learning method based CNN has a classification accuracy of 82.43%.
Keywords/Search Tags:acoustic scene classification, convolutional networks, low complexity, hierarchical pre-training
PDF Full Text Request
Related items