Design And Implementation Of Audio Caption Algorithm Based On Zynq Platform

Posted on:2022-12-18

Degree:Master

Type:Thesis

Country:China

Candidate:Z J Meng

Full Text:PDF

GTID:2518306764971249

Subject:Telecom Technology

Abstract/Summary:

PDF Full Text Request

Audio Caption(AC)is the highly abstract cross-modal translation task of describing complex acoustic scenes in highly condensed natural language.Audio caption can model physical properties of concepts,objects,environments,and high-level knowledge,and is used in many areas such as automatic content description for intelligent and contentoriented machine-to-machine interaction.Current audio caption systems are typically based on the encoder-decoder model,a neural network-based algorithm that performs well in audio caption tasks but is difficult to deploy on less-resourced hardware platforms due to the huge number of parameters and computational effort.The thesis addresses these issues and the main work is as follows.1)Study the current state of the art in the field of audio caption both at home and abroad,and select the CNN-Transformer,which currently performs well in the field of audio caption,as the encoder-decoder network.2)According to the requirements of the selected model and subsequent optimization,the public dataset Clotho was pre-processed to construct a multi-label audio dataset containing 300 words and a search dictionary containing 4368 words.3)In order to reduce the number of model parameters,the selected audio caption algorithm was improved based on VGGish.The final algorithm model was determined by experimenting with different structures of the encoder,and the model was optimized using pre-training and Fine-tune methods,and finally,the optimized algorithm was subjected to fixed-point simulation experiments with different bit widths.4)The thesis selects the MZ 7035 FA Xilinx Zynq as the implementation platform and designs an audio caption system consisting of audio acquisition,feature extraction and audio caption modules.The audio acquisition part reduces the audio data to be stored through frame storage strategy;the feature extraction part uses the method of storing only valid parameters to reduce the filter coefficients to be stored for Mel filtering;the audio caption encoder part designs a configurable and reusable convolution module to improve the efficiency of storage resources utilization,and the decoder part designs a reusable PE group computation engine for matrix multiplication operations to improve the computational resource utilization efficiency.Compared with the original algorithm,the thesis shows a 58.2% decrease in the number of audio data parameters required and an 8.9% decrease in the number of model parameters,with only a 2.6% loss in performance.The audio caption algorithm was also optimized for design and implementation on the hardware platform,reducing the amount of resources consumed and the amount of computation.In the end,the system consumed60.4% of BRAM resources and 63.6% of DSP resources,with a power consumption of3.192 W.In actual tests,the audio caption system performance index SPIDEr was 0.202.

Keywords/Search Tags:

Audio Caption, Log-Mel, CNN, SoC, Transformer

PDF Full Text Request

Related items

1	Research On Image Caption Model Of Multi-target Language
2	Research On Image Caption Method Based On Deep Learning
3	Research On Image Caption Generation Model Based On Attention Mechanism
4	Study Of Video Caption Extraction Algorithm Based On Spatial-Temporal Information
5	Image Caption Model Based On Deep Reinforcement Learning
6	Research On Key Technologies Of Image Caption Based On Multimodal Feature Understanding
7	Research On Image Caption Generation Method Based On Deep Learning
8	Image Description Method Based On Deep Learning
9	Analysis, Based On The Detection And Extraction Of The News Video Subtitles
10	Image Caption Method Based On Deep Learning