Theories And Applications Of Multi-modal Learning

Posted on:2020-11-27

Degree:Master

Type:Thesis

Country:China

Candidate:D C Yu

Full Text:PDF

GTID:2428330623963708

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of artificial intelligence,we require machine own high-level abilities of understanding and reasoning what they see.Data always exists with multiple modalities and varies in statistical characteristics,but they are closely related and complementary with each other in most time.This makes the researches focus on multi-modal learning of great significance.Multi-modal learning is closely integrated with tasks and covers a variety of theoretical approaches and applications.This paper first introduces the basic methods of multimodal learning,including representation,translation,alignment,fusion,and colearning.We summarize basic theoretical methods with some current research hotspots.We introduce feature selection and model construction of multi-modal learning relying on two specific tasks: visual question answering and 3D object positioning in assisted surgery.We have described in detail of the representation way and model structure and made an explanation.This paper is devoted to the study of specific theoretical methods and applications in multimodal learning tasks to improve models' representation abilities and achieve better results.In the Visual question answering(VQA)task,we propose a VQA model based on structured semantic representation.we illustrate the compositionality of general cognitive ability in VQA and take the linguistic structure of language into consideration in semantic representation.We decompose the question into several components by the semantic tree and apply a tree-structured model to distill the sentence representation.In addition,we exploit the complementary image of the new dataset and optimize the classifier used to predict answers.We design a dual path network for the new VQA 2.0 dataset in the training process to lead the model to effectively take advantage of the property of the dataset.Assisted surgery is a very important part of biomedicine.In the 3D object localization task of surgical instruments,we propose to use optical flow field to supplement the motion information.It can improve the positioning accuracy of the target and the robustness of the algorithm to face with occlusion and jitter problems.This is the application of a multimodal learning method.In addition,a neural network is proposed to predict matching 3D coordinate points to optimize the results of pose prediction.This multi-modal end-to-end framework which can have better localization results and practical applications.

Keywords/Search Tags:

Multi-modal learning, Visual Question Answering, Optical flow, 3D Object localization and tracking

PDF Full Text Request

Related items

1	Research On Visual Question Answering Based On Deep Learning
2	Multi-modal Information Fusion In Visual Question Answering
3	Research And Algorithm Implementation Of Efficient Visual Question Answering Based On Deep Learning
4	Relation-based Visual Question Answering
5	On Localization And Object Tracking For IARC Mission 7
6	Enhanced Visual Feature For Visual Question Answering
7	Exploring Multi-Step Reasoning And Visual Localization In Video Question Answering
8	Research On Affective Visual Question Answering
9	Visual Question Answering Model Based On Answer Type Prediction
10	Research On Visual Question Answering Based On Text Semantic Understanding