Font Size: a A A

Design And Implementation Of Audio Caption Algorithm Based On Zynq Platform

Posted on:2022-12-18Degree:MasterType:Thesis
Country:ChinaCandidate:Z J MengFull Text:PDF
GTID:2518306764971249Subject:Telecom Technology
Abstract/Summary:PDF Full Text Request
Audio Caption(AC)is the highly abstract cross-modal translation task of describing complex acoustic scenes in highly condensed natural language.Audio caption can model physical properties of concepts,objects,environments,and high-level knowledge,and is used in many areas such as automatic content description for intelligent and contentoriented machine-to-machine interaction.Current audio caption systems are typically based on the encoder-decoder model,a neural network-based algorithm that performs well in audio caption tasks but is difficult to deploy on less-resourced hardware platforms due to the huge number of parameters and computational effort.The thesis addresses these issues and the main work is as follows.1)Study the current state of the art in the field of audio caption both at home and abroad,and select the CNN-Transformer,which currently performs well in the field of audio caption,as the encoder-decoder network.2)According to the requirements of the selected model and subsequent optimization,the public dataset Clotho was pre-processed to construct a multi-label audio dataset containing 300 words and a search dictionary containing 4368 words.3)In order to reduce the number of model parameters,the selected audio caption algorithm was improved based on VGGish.The final algorithm model was determined by experimenting with different structures of the encoder,and the model was optimized using pre-training and Fine-tune methods,and finally,the optimized algorithm was subjected to fixed-point simulation experiments with different bit widths.4)The thesis selects the MZ 7035 FA Xilinx Zynq as the implementation platform and designs an audio caption system consisting of audio acquisition,feature extraction and audio caption modules.The audio acquisition part reduces the audio data to be stored through frame storage strategy;the feature extraction part uses the method of storing only valid parameters to reduce the filter coefficients to be stored for Mel filtering;the audio caption encoder part designs a configurable and reusable convolution module to improve the efficiency of storage resources utilization,and the decoder part designs a reusable PE group computation engine for matrix multiplication operations to improve the computational resource utilization efficiency.Compared with the original algorithm,the thesis shows a 58.2% decrease in the number of audio data parameters required and an 8.9% decrease in the number of model parameters,with only a 2.6% loss in performance.The audio caption algorithm was also optimized for design and implementation on the hardware platform,reducing the amount of resources consumed and the amount of computation.In the end,the system consumed60.4% of BRAM resources and 63.6% of DSP resources,with a power consumption of3.192 W.In actual tests,the audio caption system performance index SPIDEr was 0.202.
Keywords/Search Tags:Audio Caption, Log-Mel, CNN, SoC, Transformer
PDF Full Text Request
Related items