Font Size: a A A

Multi-view Neural Network Learning Approaches For Cross-modal Retrieval And Classification

Posted on:2022-08-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:X WangFull Text:PDF
GTID:1488306551969969Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Due to the rapid development of Internet multimedia technology,the widespread use of smartphones,and the growing popularity of social networks,people can share interesting content on the Internet anytime and anywhere,making all kinds of multimedia data(text,images,and videos,etc.)on the Internet showing the characteristics of explosive growth and massive agglomeration.Such a large scale of data marks the arrival of the era of multimedia big data,and at the same time brings new opportunities as well as challenges to the research and application based on multimodal learning.With the rapid development of artificial intelligence technology represented by the deep neural network,how to simulate the cognitive and understanding process of the human brain based on the deep neural network to realize the semantic connection and content understanding of multimodal data,is becoming a research hotspot in the field of artificial intelligence and multimedia.It is also a key research problem to be solved in multimodal learning.This research problem is also known as Cross-modal Content Understanding(CCU).CCU is of great significance and value in national,social,and personal life.Cross-modal Retrieval and Classification,as a practical research task in CCU,aims to retrieve and classify semantically related data in all modalities from the database when the user is given query data of any modality.However,due to the huge differences in the data organization form and structure of different modalities(also known as the heterogeneous gap),it is hard to measure the semantic similarities across different modalities directly.The heterogeneous gap makes cross-modal retrieval and classification face great challenges.Under the influence of different supervision modes,CCU also presents different characteristics and difficulties.To solve the challenge of the heterogeneous gap,this thesis focuses on the research problem of CCU under different supervision modes,and focuses on the following four scientific issues in cross-modal retrieval and classification tasks:·Issue 1: In the case where multimodal data is labeled,how to avoid the errors caused by the mismatch between metric and loss function,and at the same time exploit the information from different views as balanced as possible?·Issue 2: In the case where multimodal data is labeled,how to break through the time and space limitations caused by coupling training in the process of multi-view hash learning?·Issue 3: In the case of a small amount of labeled data and a large amount of unlabeled data,how to effectively exploit semi-supervised learning to make better use of unlabeled data to help improve the accuracy and robustness of cross-modal content understanding?·Issue 4: In the case where multimodal data is unlabeled,how to effectively exploit unsupervised learning to understand cross-modal content and narrow the heterogeneous gap between different modalities?To solve these issues,this thesis designs and proposes a series of multi-view neural network learning approaches under different supervision modes.The research results of the thesis are summarized as follows:1.A new approach called Deep Relational Similarity Learning(DRSL)is proposed in the thesis.Different from existing approaches,the proposed DRSL is supposed to directly learn a pairwise relational similarity matrix instead of explicitly learning a shared space.Hence it can refrain from the issue of unbalanced information across distinct modalities.Meanwhile,the pairwise relational similarity serves as the cross-modal retrieval metric,thereby introducing no extra error to the matching of the loss function and retrieval metric.Extensive experiments on four public datasets show the promising performance of the proposed approach in cross-modal retrieval task.2.A novel approach named Separated Variational Hashing Networks(SVHNs)is designed to separately transform any number of modalities into a common Hamming space.SVHNs approach consists of a label network(Lab Net)and multiple modality-specific networks.Lab Net is used to exploit all available label annotations to learn a latent common Hamming space by projecting the semantic labels into the common binary codes.Then,the modality-specific variational networks can separately project multiple modalities into their common semantic binary representations learned by Lab Net.It is realized by variational inferring that matches the aggregated posteriori of Lab Net's hashing code vector with any prior distribution.Extensive experimental results on four widely-used benchmark datasets and the comprehensive analysis have demonstrated the effectiveness of the designed Lab Net and the variational inference,leading to superior cross-modal retrieval performance as well as more efficient and flexible training patterns compared to current state-of-the-art methods.3.A new approach dubbed Deep Semisupervised Class-and Correlation-Collapsed CrossView Learning(DSC3L)is proposed in the thesis.DSC3 L learns the discriminative space through collapsing instances of the same class into the same point,and instances of different classes into different points simultaneously.Meanwhile,to fully exploit the unlabeled data,DSC3 L proposes modeling the correlation of unlabeled data by collapsing the correlated samples into the same point while collapsing the uncorrelated samples into other points.These two objectives are jointly optimized by minimizing two KullbackLeibler divergences.Furthermore,DSC3 L can be applied to more than two views.A large number of experiments on five benchmark datasets show that DSC3 L has promising performance on cross-view retrieval and classification tasks.4.A novel unsupervised multi-view representation learning approach called Adversarial Correlated Autoencoder(Adv CAE)is designed to learn common representations for multiview data.The approach conducts variational inference by matching the aggregated posteriors of the latent variable with a specific prior distribution.Benefiting from the model,the representations of different views could follow the same distribution.Comprehensive experiments on five benchmark datasets show the promising performance of the proposed method on cross-view classification and cross-view retrieval tasks.
Keywords/Search Tags:Multi-view neural network, Multi-view learning, Representation learning, Deep learning, Common space, Cross-modal retrieval and classification, cross-modal content understanding
PDF Full Text Request
Related items