Font Size: a A A

Towards A Lightweight Framework For Multimodal Reasoning Learning

Posted on:2024-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:Z T JinFull Text:PDF
GTID:2568307103970039Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Multimodal reasoning learning,as a research hotspot in the field of artificial intelligence in recent years,aims to simultaneously process multiple modal information in task learning,such as image and text modal.Currently,researchers usually use deep learning models with large parameters and high computational requirements to achieve better performance in multimodal tasks,which presents a significant challenge for model deployment.To address this issue,there is an urgent need to study lightweight methods for multimodal reasoning.Currently,there are two main challenges in the research of lightweight multimodal reasoning methods: 1)at the single-task level,taking visual question answering(VQA)task as an example,how to adapt existing models to different hardware devices during deployment.Current VQA models are usually of fixed size,while the variety of hardware devices in reality makes it impossible for the models to adjust to the computing resources of different hardware.2)At the multi-task level,how to solve the problem of excessive storage and computation costs when deploying existing multimodal pre-training models.The traditional full fine-tuning method leads to excessive storage costs during model deployment,while existing adapter-tuning methods increase inference computation costs.How to achieve both the reduction of storage overhead and inference computation overhead is the challenge when pre-training models are deployed.To solve the above two challenges,this thesis proposes the following two approaches.To address the problem that VQA models cannot adapt to different hardware devices when deployed,this thesis proposes a general bilaterally slimmable Transformer framework for VQA,which designs an efficient bilateral slim strategy for the Transformer architecture in the width and depth dimensions,so that each slimmed submodel maintains an optimal structure.At the same time,this method uses a triangular filtering strategy to remove redundant submodels before training.This strategy not only reduces the training costs but also improves the performance of the remaining submodels.Finally,a knowledge distillation-based sampling training algorithm is proposed,which significantly improves the efficiency of model training.To address the challenge of excessive storage and computational costs in the deployment of pre-training models,this thesis proposes a prune-and-fill adapter framework,which prunes part of the weights of the pre-training model and insert some lightweight task adapters in the pruned placement so that the pre-training model still has the learning capability for downstream tasks while retaining multimodal generic knowledge.Finally,a progressive guided distillation training algorithm is proposed to better bridge the gap between the pre-training task and the multimodal downstream tasks,which guarantees the performance of the pre-training model.In order to verify the effectiveness of the proposed method,sufficient experiments are conducted on a wide range of multimodal datasets,and the results show the superiority of the proposed method over existing methods in several metrics.Meanwhile,a lightweight multimodal reasoning system is designed and implemented with the proposed method as the core.
Keywords/Search Tags:Multimodal Reasoning Learning, Visual Question Answering, Multimodal Pre-training, Lightweight Methods, Model Deployment
PDF Full Text Request
Related items