Font Size: a A A

Research On Automated Audio Captioning Based On Fine-Grained Semantic Information Perception

Posted on:2024-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:F Y XiaoFull Text:PDF
GTID:2568306941998189Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Automated audio captioning is a cross-modal audio content understanding task that integrates audio signal processing and natural language processing.It aims to summarize the semantic content in audio signals by the natural language text,i.e.,caption.This task can facilitate man-machine interaction for those with hearing loss,sound analysis for security surveillance and automatic content summarization for smart city construction.However,most existing methods usually ignore the local acoustic event information in the audio features when decoding them to generate the caption.Moreover,the feature extraction process in these methods is difficult to model the global contextual semantic content in the audio,which limits the performance of these automated audio captioning methods.To address these issues,we conducted research on automated audio captioning with the exploration of fine-grained semantic information in this paper.The main research contents are shown as follows:Firstly,a local event information assisted automated audio captioning method is proposed to capture the local information within audio feature.In method,we propose a local information assisted attention-free Transformer(Local AFT)decoder to generate the caption from the audio feature.In the Local AFT decoder,we design two key modules,i.e.,future interference masking(FIM)module and local information assisted captioning(LAC)module.The FIM module is designed for the sequential modelling.In LAC module,we introduce a window function to focus on the local region of the audio feature,thereby the LAC module can capture the potential local event information within the audio feature and decode it as the caption.In summary,the proposed P-Local AFT method improves the captioning performance on the local event representation.Subsequently,the audio captioning method with graph modelling(Graph AC)is proposed,which can capture the global contextual semantic information about acoustic scene for audio captioning by a graph feature representation based audio encoder.In the graph feature representation based audio encoder,a graph learning based feature representation module is designed,which clips the audio feature into audio feature frame nodes,builds the adjacency graph to learning the contextual association between these nodes and filters out the meaningless nodes relation by the top-6)mask strategy.Then,with the learnt adjacency graph,the nodes aggregation in the graph learning based feature representation module can highlight the important semantic information about acoustic scenes.Thus,the Graph AC method further improves the captioning performance.Finally,an ensemble learning based method is proposed to construct a fusion prediction solution that explores the global contextual information in acoustic scene and local acoustic event information,achieving complementary advantages and further fine-grained semantic information exploration and utilization.The proposed ensemble learning method was submitted to the automated audio captioning task of the detection and classification of acoustic scenes and events(DCASE)2022 challenge,and achieved a ranking of third on the public test set and sixth on the internal test set.This fully demonstrates the effectiveness and superiority of the proposed automated audio captioning method based on the fine-grained semantic information perception.
Keywords/Search Tags:Automated audio captioning, Fine-grained semantic information perception, Local event information assistant, Graph learning based feature representation, Ensemble learning
PDF Full Text Request
Related items