Font Size: a A A

Automatic Audio Tagging Method In Complex Scene

Posted on:2019-07-08Degree:MasterType:Thesis
Country:ChinaCandidate:L S ZhangFull Text:PDF
GTID:2428330566996741Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Presently,China's artificial intelligence technology has entered a phase of rapid development.Audio and voice,as important interfaces for the interaction of smart devices with the outside world and human beings,have received extensive attention from the government,industry,and academia.The State Council has proposed an artificial intelligence development plan and pointed out that China will use artificial intelligence technology widely in such areas as education,medical care,old-age care,environmental protection,and urban construction.Audio tagging technology will provide effective services for these services.However,the sound field information in these practical application scenes is usually very complex,and sometimes contains a variety of sound sources and different levels of noise.Therefore,in actual applications,it is an urgent demand to use automatic audio annotation method in complex scene to provide information of the sound modality for intelligent systems.At present,there is a lack of a standard and effective audio data processing flow and effective models for audio data in complex scenes.Therefore,this topic focuses on automatic audio tagging task in complex scenes.Based on the analysis of the nature of audio data in complex scenes,we proposed a workflow of audio data processing in complex scenes.Detailed analysis of the time domain and frequency domain information of audio data in complex scenes reveals that the audio category's information involves in both the time domain and the frequency domain.The patterns in frequency domain is of uncertainties and distributed not uniformly.For example,there are large amounts of silent clips in some audio files.The analysis of the distribution of the categories in the dataset and the duration of the audio files reveals a problem that audio categories are in unbalance distribution.Based on this three conclusions,an audio processing workflow,including audio activity detection and noise removal was designed.We also proposed strategies of data expansion and samples oversampling on dataset.We designed an experiment to prove that the design of data processing workflow could significantly improve the performance of audio tagging tasks in complex scenes.Aiming at the nature that information distributed in time domain is not uniform and the pattern of categories in frequency domain is variable,we propose the self-attention Inception LDNN model.Based on the two conclusions figured out of data analysis,we propose a deep learning model with attention mechanism and multi-size convolutional layers.By experiments aim at evaluating the performance of these two mechanisms and the performance of self-attention Inception LDNN,we verified that both of these two mechanisms work,and the self-attention Inception LDNN could achieve state of the art performance for automatic audio tagging in complex scenes.
Keywords/Search Tags:audio tagging, audio in complex scene, audio signal processing, deep neural network, artificial intelligence
PDF Full Text Request
Related items