Font Size: a A A

Research On Deep Neural Network Model Of Audio Tagging

Posted on:2021-10-21Degree:MasterType:Thesis
Country:ChinaCandidate:L CuiFull Text:PDF
GTID:2568306104464074Subject:Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the success of speech recognition and image processing based on deep learning,audio tagging has also attracted more and more attention.With the development of intelligent mobile devices,a large number of users upload recordings to the network every day.How to label audio is an important research direction.Traditional manual feature making and shallow structure classifier need a lot of work,and they cannot make good use of the potential relationship between context information and different sound event classes.In view of the existing problems,this paper applies the deep learning neural network method to audio tagging to explore the impact on accuracy and performance.Firstly,a learnable context gating can help to select the most related features of the final audio event class.The attention mechanism can help the model pay more attention to the most related audio frames of the audio event class.Therefore,context gating and attention mechanism are introduced into convolutional recurrent neural network to form attention gated convolutional recurrent neural network(AT-GCRNN).The AT-GCRNN is used for general audio tagging and compared with convolutional neural network(CNN)and convolutional recurrent neural network(CRNN).The experimental results show that ATGCRNN has better performance than CNN and CRNN in audio tagging accuracy.Secondly,the time-frequency segmentation mask network can separate the timefrequency domain sound events from the background scene and enhance the sound events in the audio clip.Compared with CNN,the Mobile Net V2 reduces the network parameters.Res2 Net can increase the receptive field of each network layer.Therefore,the improved time-frequency segmentation network is used to model the urban sound tagging,and the comparison test with VGGNet and CNN network is carried out.The results show that the improved time-frequency segmentation network model is faster and more accurate than other networks.Finally,a deep neural network framework based on the combination of atrous convolution and Res2 Ne Xt is constructed and applied to the urban sound tagging.The model is compared with VGGNet network and modified-Mobile Net V2 model.The atrous convolution has the characteristics of capturing multi-scale context information.Res2 Ne Xt is an improvement based on Res2 Net,which can improve the accuracy of classification on the premise of reducing the number of super parameters.The results show that the classification performance of the model is better than that of the other two networks.
Keywords/Search Tags:Audio tagging, Deep learning, Convolutional recurrent neural network, Time-frequency segmentation network, Atrous convolution
PDF Full Text Request
Related items