Font Size: a A A

Weakly Supervised Learning For Audio Analysis

Posted on:2021-01-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:H L H e i n r i c h D i n Full Text:PDF
GTID:1488306506450084Subject:Computer Science
Abstract/Summary:PDF Full Text Request
Audio analysis is a crucial aspect to bestow machines the ability of human-like hearing and perception.Recent advances in machine learning,specifically deep learning,have resulted in near-human performances for automatic speech recognition or speaker recognition.Even though deep learning can scale with large datasets,the limits of current deep learning technologies are the reliance on big data and strong human-annotated labels.Therefore,most current deep learning approaches are costly since they require collecting vast amounts of data or human-annotated labels to work.Specifically,since there is no established common notion of labeling method in the RAA field,the reliance of human-annotated labels is a much more severe problem.For example,standard labeling methods have been established for well-designed tasks: automatic speech recognition(ASR)and speaker verification(SV).ASR only requires text transcriptions of spoken content without labeling precise timestamps,while large SV datasets such as VoxCeleb require visual face information to identify speakers.Hence,within audio analysis,a cheaper,yet less well-researched approach to counteract the strong labeling requirement is the use of weak labels.Weak labeling has many facets,such as inexact labeling(e.g.,labels are only given on clip-level or labels are only provided as a “bag”)and incorrect and incomplete labeling(e.g.,wrongly annotated labels or missed labels).Due to these problems stated above,weakly labeled machine learning approaches generally lead to sub-par performance compared to strongly supervised methods.This thesis provides an investigation into the limits of weak labeling.Specifically,the thesis discusses problems related to training neural network models,in inexact labeling and scarce data scenarios.This thesis' s main contribution focuses on three aspects: The absence of sufficient data,the localization capability,and the noise robustness of neural networks using weak labeling.First,the problem of scarce data training for weakly labeled audio analysis is investigated.A novel self-supervised training method to circumvent data scarcity problems is proposed.The training method is akin to the common Word2 Vec approach utilized in natural language processing and is therefore named Audio2 Vec.Audio2Vec is applied to two weak labeling domains: depression detection and audio tagging.When applied to depression detection,the embedding becomes DEPA(depression audio embedding),whereas applied to audio tagging,the embedding becomes SATE(self-supervised audio tagging embedding).Experiments on both tasks show that DEPA can successfully improve depression detection performance,outperforming previous methods based on traditional front-end features and advanced methods such as x-vector.Further,the limits of the Audio2 Vec approach are investigated using the SATE embedding approach.Second,localization of sound events using weak labeling during training is difficult due to the lack of appropriate duration supervision.This thesis analyzes current problems and limits of duration robustness within weakly supervised sound event detection.The analysis reveals that previous methods lack flexibility towards multiple datasets and require strongly supervised data to function.Further,it is revealed that most previous methods only perform well due to their dataset-dependent post-processing.This thesis introduces a novel,dataset agnostic post-processing method,which does not require strong labeling to function.Another contribution is the introduction of a duration robust CRNN framework(CDur).CDur is compared against previous approaches and shows significant performance boosts in terms of tagging-and event-F1 scores.Further,CDur is seen to be the only method capable of working on a plethora of datasets.Also,CDur is compared against previous state-of-the-art methods without utilizing postprocessing,which reveals a significant performance gap between CDur and previous approaches.Lastly,CDur is seen to outperform a strongly supervised approach on the Urban SED dataset.For the first time within the field of weakly supervised sound event detection,a weakly supervised approach outperforms a strongly supervised one.Third,the achievements of CDur are directly applied to voice activity detection(VAD).Previous supervised VAD models require initial framelevel labeling via a hidden Markov model(HMM)in order to work.The disadvantage of those approaches is that they relied on clean,strongly supervised data for VAD training.This thesis proposes weakly labeled VAD,which can be directly trained from clip-level labels,avoiding strong labels obtained via an HMM.The approach is named general-purpose VAD(GPVAD)since it enables the use of any,unconstrained,real-world dataset.CDur is used as the architecture of choice.Results showcase the performance differences between traditional supervised VAD and GPVAD.Traditional VAD approaches,due to their strong supervision on frame-level,outperform GPVAD in clean scenarios.However,the more noise is present during evaluation,the better GPVAD's performance becomes.Ultimately,GPVAD can outperform traditional VAD methods in real-world scenarios.Further,datadriven GPVAD is proposed in this thesis.The core idea is to train a teacher model on weakly labeled data and then pass strongly-labeled,soft predictions to a student.This method can be applied to virtually any dataset,enabling the training of language-agnostic and noise-robust VAD models.Besides,leveraging extensive unsupervised datasets for data-driven GPVAD is investigated.Results show that the proposed data-driven GPVAD framework greatly enhances performance compared to traditional GPVAD model training.A comparison between previous SOTA models and the proposed approach shows a significant performance discrepancy.Lastly,the proposed data-driven GPVAD framework is compared against industry-grade datasets and approaches regarding their automatic speech recognition(ASR)performance.Results show that by adapting the data-driven GPVAD framework to a target language(Chinese),performance on par with traditional supervised large-data VAD models.In fact,for three out of four proposed tests,the weakly supervised GPVAD approach outperforms traditional VAD.For the first time,it has been shown that VAD models can be trained using weak labeling without impacting performance.
Keywords/Search Tags:Weakly supervised learning, deep neural networks, audio analysis, sound event detection, voice activity detection, depression detection, audio tagging
PDF Full Text Request
Related items