Font Size: a A A

Research On Temporal Relation-based Audio Semantic Representation Learning

Posted on:2022-04-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:L W ZhangFull Text:PDF
GTID:1488306569984269Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Sounds contain a great deal of information about our environments.With huge amount of audio data available on the Internet,and the rapid growth of the degree of dependence on smart devices in daily life,there is an urgent need for machines with better ability of perceiving and understanding sounds.The research on sound perception and understanding is now mainly focused on acoustic event recognition and acoustic scene classification.Both of them belong to the task of audio classification,in which one of the key problems is how to learn effective semantic representations for audio samples.Audio is a kind of temporal signal in nature,its semantic content depends not only on the content of its elements,but also on the temporal relations between those elements.However,without explicitly considering the temporal dependencies between the framelevel or segment-level elements,both the traditional frame-level acoustic feature extraction methods,and the current deep learning-based segment-level time-frequency representation learning methods would fail to effectively characterize the entire temporal information in the audio samples.The absence of complete temporal information prevents these methods from effectively obtaining the audio representation with complete semantic contents for each sample.Although in recent years,several studies have attempted to incorporate the long-term dependences between elements into the audio semantic feature learning process by using the time sequence modeling methods,these methods have limitations in learning effective temporal information and do not fully consider the complexity and diversity of audio signals.Therefore,it is still necessary to carry out a more in-depth study on temporal relation-dependent audio semantic representation learning methods.To this end,this thesis studies the learning method of temporal information in audio samples from two aspects of the unsupervised and supervised learning,and proposes a series of solutions:First,to explore a semantic representation learning method that can effectively describe complete temporal information,a time sequence modeling method that can capture the temporal relations between elements on a single scale is proposed from the simplest case.Firstly,the segment-level element representation sequence of each audio sample is constructed with a fixed time scale by using the bag-of-audio-words method,and then a regression-based unsupervised temporal feature learning method is proposed.Constrained by the chronological order of the elements,the proposed method encodes the temporal relations by using a linear function with learnable parameters,which are learned by solving a support vector regression problem.Finally,the learned parameters are taken as the representation of the complete input sequence.In addition,by using the element representation sequence obtained by the sparse coding-based bag-of-audio-words as input,the proposed method can generate more robust audio semantic representation.The experimental results show that the proposed unsupervised feature learning method can effectively improve the performance of audio classification system.Second,for most of the present methods,the temporal relations are obtained by performing the time sequence modeling on a uniform scale representation sequence.However,the sound elements in audio data can vary dramatically in a short time or be stable for a relatively long time,their corresponding time scales should not be exactly the same.Using uniform scale representation is insufficient to fully reflect this characteristic of audio data.It is necessary to use a multi-scale representation.Therefore,the previously proposed representation learning method is further extended to a hierarchically structured network,called pyramidal temporal pooling network,which aims at capturing the temporal relations between multi-scale elements.First,a convolutional neural network which can characterize the local time-frequency structure in multiple scales,is used to construct the element representation sequence of the samples.Then the pyramidal temporal pooling network is used to obtain more expressive audio representations.The experimental results show that the proposed method can effectively improve the performance of acoustic event recognition and acoustic scene classification.Third,considering that the category labels of audio samples contain priori knowledge of human cognition,taking advantage of this priori knowledge will help to obtain more effective semantic representation.To this end,by using the basic idea of bi-level optimization,a supervised learning method of temporal features is proposed to introduce the prior information of categories into the learning of temporal relation-dependent semantic representation.First,a two-layer optimization structure for the task-driven temporal semantic representation learning problem is constructed.In this structure,the temporal encoding problem of the previous regression-based unsupervised learning method is taken as the underlying constraint condition of the optimization objective of the top-level classifier.Then a gradient-based optimization strategy is used to solve the bi-level optimization problem,so that the temporal relation-dependent audio representations and classifier parameters can be jointly learned.The experimental results show that the proposed method can obtain more discriminative semantic audio representations in the feature space of lower dimension.Finally,considering there are semantically unrelated short-term sound patterns in the acoustic scene,and these patterns has no temporal dependencies.Capturing the temporal relations of all patterns would inevitably introduce the redundant information,thus it is necessary for the model to focus on the temporal relations of those semantically related patterns.To this end,the idea of learning the temporal relation between elements under the constraint of semantic nearest neighbor is proposed,and then an end-to-end3-dimensional convolutional neural network is proposed for jointly learning the representations of elements,the semantic representation based on temporal relations between elements,and the classifier.The proposed network first maps the representation of each element obtained by stacking convolution operations into a semantic space,and finds the semantic neighbors for each element by clustering.Then it uses a multi-layer perceptron to learn the temporal relations between each element and its nearest neighbors.In addition,an attention-based pooling method is proposed to aggregate temporal relations in the semantic neighborhood,so that the network can adaptively learn temporal relations in a larger neighborhood that may benefit the classification.The experimental results show that the proposed network can achieve good classification performance on acoustic scene classification and is superior to most mainstream deep learning methods.
Keywords/Search Tags:Acoustic event recognition, acoustic scene classification, audio semantic representation learning, temporal relation, joint learning, convolutional neural network
PDF Full Text Request
Related items