Font Size: a A A

Research On DNN-based Cross-Modality Media Analysis

Posted on:2022-06-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:S WangFull Text:PDF
GTID:1488306560453644Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Cross-modality media analysis is one of artificial intelligence tasks.It aims to process the data in different modalities,such as images and sentences,and constructs the internal relationship between these different modalities data.For different application scenarios,this task is usually divided into many specific subtasks,such as cross-modal retrieval,multimedia translation,few-shot learning etc.Recently,with the development of computer technology,deep learning technology has gradually become an indispensable part of cross-modality media analysis tasks.It is used to express the content of multi-modality data or transform the modalities of media.This thesis analyzes the characteristics of different modal data and the related deep learning methods.Then,it explores the analysis and comprehension of multi-modal data task in different directions.The main work of this thesis is as follows:(1)To address the problem in cross-modality retrieval,this thesis proposes a cross-modal learning model with joint correlative calculation learning.First,an auto-encoder is used to embed the visual features by minimizing the error of feature reconstruction,and a multi-layer perceptron is utilized to model the textual features embedding.Then a joint loss function is designed to optimize both the intra-and the inter-correlations among the image-sentence pairs,i.e.,the reconstruction loss of visual features,the relevant similarity loss of paired samples,and the triplet relation loss between positive and negative examples.The proposed method optimizes the joint loss based on a batch score matrix and utilize all mutual mismatched paired samples to enhance its performance.The experiments in the retrieval tasks demonstrate the effectiveness of the proposed method.It achieves comparable performance to the state-of-the-art on three benchmarks,i.e.,Flickr8k,Flickr30k,and MS-COCO.(2)This thesis proposes a hybrid deep architecture which consists of a temporal convolution module(TCOV),a bidirectional gated recurrent unit module(BGRU),and a fusion layer module(FL)to address the continuous sign language translation problem.TCOV captures short-term temporal transition on adjacent clip features(local pattern),while BGRU keeps the long-term context transition across temporal dimension(global pattern).FL concatenates the feature embedding of TCOV and BGRU to learn their complementary relationship(mutual pattern).Thus,a joint connectionist temporal fusion mechanism is proposed to utilize the merit of each module.The proposed joint CTC loss optimization and deep classification score-based decoding fusion strategy are designed to boost performance.With only once training,our model under the CTC constraints achieves comparable performance to other existing methods with multiple EM iterations.Experiments are tested and verified on a benchmark,i.e.,the RWTH-PHOENIX-Weather dataset,which demonstrates the effectiveness of our proposed method.(3)To align the sign language actions and translate them into the respective words automatically,this thesis proposes a dense temporal convolution network,termed Dense TCN which captures the actions in hierarchical views.Within this network,a temporal convolution(TC)is designed to learn the short-term correlation among adjacent features and further extended to a dense hierarchical structure.The kth TC layer integrates the outputs of all preceding layers together:(a)The TC in a deeper layer essentially has larger receptive fields,which captures long-term temporal context by the hierarchical content transition.(b)The integration addresses the sign language translation problem by different views,including embedded short-term and extended long-term sequential learning.Finally,the CTC loss and a fusion strategy is adopted to learn the feature-wise classification and generate the translated sentence.The experimental results on two popular sign language benchmarks,i.e.,PHOENIX and USTC-Con Sents,demonstrate the effectiveness of our proposed method in terms of various measurements.(4)This thesis proposes a method based on multi-modal knowledge discovery for few-shot learning task.First,the visual knowledge is used to help the feature extractors focus on different visual parts.Second,a classifier is designed to learn the distribution over all categories.In the second stage,we develop three schemes to minimize the prediction error and balance the training procedure:(a)Hard labels are used to provide precise supervision.(b)Semantic textual knowledge is utilized as weak supervision to find the potential relations between the novel and the base categories.(c)An imbalance control is presented from the data distribution to alleviate the recognition bias towards the base categories.We apply our method on three benchmark datasets,and it achieves state-of-the-art performances in all the experiments.
Keywords/Search Tags:deep learning, cross-modality media analysis, cross-modal retrieval, video translation, few-shot learning
PDF Full Text Request
Related items