Font Size: a A A

Research On Deep Multi-modal Clustering Algorithms

Posted on:2023-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:G Z KeFull Text:PDF
GTID:2568306791492554Subject:Systems Engineering
Abstract/Summary:PDF Full Text Request
Multi-modal learning is a machine learning paradigm that extracts high-quality representations from multi-source data,and is widely used in video search,human-computer interaction,and sentiment analysis.For example,in video search,video content and search keywords are multi-modal data.High-quality multi-modal representation can not only improve the retrieval efficiency of video content,but also improve the accuracy of video retrieval.In the past two decades,multi-modal algorithms based on deep learning have made great progress.However,most of the success of methods comes from a large amount of annotated data.With the ever-changing practical application environment,high-quality data has become one of the bottlenecks in the development of multi-modal learning.Therefore,how to train multi-modal models with unlabeled data has become one of the key research in the field of multi-modal learning.Clustering is a classic unsupervised learning method.It divides the samples into several groups according to the distance of the samples’ features in the feature space.The samples within the clusters are similar,but the samples between the clusters are different.Clustering algorithm is one of the potential directions to solve the problem of data hunger.Therefore,this thesis combines multi-modal learning with clustering method,and mainly studies two issues: how to improve the training efficiency of multi-modal clustering algorithm and how to improve the quality of multi-modal representation.To this end,in this thesis,we proposed two deep multi-modal clustering methods:(1)A multi-modal clustering algorithm based on generating pseudo-labels.Most of the multi-modal clustering algorithms are based on the reconstruction loss,and the reconstruction loss is prone to obtain trivial solutions during the network training process,which makes the network training inefficient.Through case analysis,we find that the low efficiency of the network training process is caused by that the reconstruction function can only provide weak training signals.Therefore,in this thesis,we introduced the pseudo-label mechanism as the training target of the network.Compared with the reconstruction method,pseudo-labels can provide a more effective training signal to network,thereby alleviating the phenomenon that the network obtains trivial solutions.Specifically,we design two modules that are trained alternately,namely “correction module” and “approximation module”,where the correction module trains the network by the reconstruction loss so that the quality of the pseudo-labels generated by the model will not be too low;the approximation module uses the pseudo-labels to train the multi-modal representation extractor.The proposed method achieves better clustering performance on 4 public datasets,among which the clustering accuracy is 10.3%higher than the second-best method on the multi-modal handwritten digit recognition dataset.(2)A multi-modal clustering algorithm based on contrastive learning.Although the reconstruction loss makes the unsupervised multi-modal algorithm work,it also forces the model pay attention to some redundant semantic information,which is noise in the clustering task.To this end,we design a fusion method based on contrastive learning to reduce the noise doped in the feature fusion process of multi-modal clustering models.Specifically,we analyze the basic concept of multi-modal fusion from the perspective of information bottleneck theory,and we find that the noise obtained by the model is caused by the model’s inability to identify the modal-specific and consistent representations.Therefore,we design a contrastive fusion module,which able to extract task-relevant information from modality features by maximizing mutual information,and discard task-irrelevant from multi-source data.The proposed method achieves the best results on 5 public multi-modal datasets,among which the clustering accuracy(ACC)is 18.3% higher than that of the method(1)on the multi-modal handwritten digit recognition dataset.
Keywords/Search Tags:Clustering, Multi-modal Learning, Contrastive Learning, Features fusion
PDF Full Text Request
Related items