Font Size: a A A

Research On Efficient Global Modeling For Audiovisual Representation Learning

Posted on:2024-07-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y C ZhaoFull Text:PDF
GTID:1528306929491554Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Deep representation learning is a research field that utilizes deep neural networks to learn data representations.With its ability to achieve high accuracy and generalizability in multimedia domains such as images,videos,and speech,it has become one of the most popular and widely used technologies in artificial intelligence today.Deep representation learning is a key method for improving the performance of various audiovisual intelligent applications.Its advantage lies in the use of data-driven learnable nonlinear transformations to establish connections between input data and high-level representations.The choice of transformation is critical to the success of the model,making it a key area of focus in representation learning research.Over the past few years,a type of transformation that possesses global modeling properties has gained significant attention,leading to the development of a set of methods known as global modeling methods.The key feature of these methods is their use of a global modeling network structure,which enables them to capture diverse correlation patterns,including global correlations,by jointly connecting all features in a data sample.This improves the representation and understanding ability of neural networks,particularly for complex patterns in large-scale data,and makes it a key technology in the development of artificial intelligence algorithms.While global modeling methods offer significant advantages in its representation learning ability,their practical application is hindered by their low computational efficiency.For instance,the computational complexity of a standard global modeling method like Transformer increases quadratically with the input data volume.Therefore,the development of efficient global modeling mechanisms is essential for achieving broader success.This thesis presents an in-depth and comprehensive study focused on enhancing the computational efficiency of global modeling methods,as well as dealing with the application problems existing in current inefficient global modeling methods.Specifically,the research content and main contributions of this thesis can be summarized into the following five aspects,which take into account different data features and application scenarios:(1)This thesis proposes the multi-scale group Transformer for the global modeling problem on one-dimensional speech data.The proposed method effectively models long one-dimensional sequences,leading to improved speech understanding.It utilizes a group self-attention mechanism and a multi-scale fusion mechanism.The method computes self-attention within restricted attention regions across multiple feature scales and uses multi-scale fusion to achieve global modeling on speech data.(2)This thesis introduces a sparse multi-layer perceptron(MLP)network for the global modeling problem on two-dimensional image data.The proposed method efficiently models images globally using MLP as the primary building block,leading to improved image understanding.It represents a powerful exploration of MLP-like global modeling network structures,which offer the advantages of simplicity and high computational efficiency compared to Transformer.The sparse multi-layer perceptron network incorporates various innovative mechanisms centered on sparsification design,addressing the challenge of effectively training MLP-like methods,as well as further enhancing the model’s computational efficiency.(3)This thesis further analyzes and studies the global modeling methods for image data,examining the scope of application of global modeling methods in image representation learning.In this study,we designed an experimental framework named spatial channel separate network to fairly compare global modeling methods with local modeling methods.Through experimental analysis,we draw conclusions that are useful for current applications and inspiring for future research on global modeling methods.(4)This thesis proposes a triple spatiotemporal decomposition method for the global modeling problem of three-dimensional videos with fixed time lengths.The method effectively achieves efficient global modeling of three-dimensional video data,enhancing the model’s ability to understand videos.To leverage the decoupling characteristics of spatiotemporal patterns in videos,we introduce an attention region partition mechanism based on the existing video Transformer model.The mechanism calculates attention on three pre-defined planes,significantly reducing the computational cost of the video Transformer,while also improving the model’s efficiency in utilizing video data.(5)This thesis also proposes a streaming video model for the global modeling problem on infinitely long streaming video data.The proposed method improves the computational efficiency of global modeling methods on streaming video data and expands the versatility of video global modeling methods.In this method,we adopt a two-staged spatiotemporal global modeling design and introduce a temporal-aware spatial encoder,realizing a universal global modeling network structure.The streaming video model can be used to solve both frame-based video tasks and sequence-based video tasks,achieving excellent performance in both types of tasks.Under the topic of global modeling for audiovisual representation learning,this thesis proposes several methods that achieve leading performance or efficiency advantages on several tasks,advancing the application of global modeling methods to a certain extent.We hope that our work can provide new directions or methods for the communities.
Keywords/Search Tags:Representation learning, Deep learning, Global representation learning, Audiovisual, Transformer, Multi-Layer Perceptron
PDF Full Text Request
Related items