Font Size: a A A

Research On Multimodal Data Modeling And Retrieval For Common Space Learning

Posted on:2019-10-15Degree:MasterType:Thesis
Country:ChinaCandidate:S J ChenFull Text:PDF
GTID:2428330572455862Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the development of big data,multimedia data represented by texts,images,and voices has reached a large scale in terms of data volume and data diversity,and the retrieval and modeling of multimedia data have gradually become more attractive,that is,the research on multi-modality existing in multimedia data has become a hot topic.In order to overcome the diversity of data structure,information complexity,and mismatch of tasks in multimodal data,multimodal data fusion plays an important role,which can be achieved by integrating the information contained in data to get a task-oriented unified representation.Among these,the common space learning used to construct a unified representation is the main approach,which models the multi-modality existing in the data and learns the potential space for multiple input so as to accomplish data retrieval,object localization,and data imbalance learning.In this study,deep neural network and deep learning are the main technologies for common space learning.First of all,for the modeling of multi-modal data,we propose a Fine-grained Progressive Attention Network(FPAN)to complete the retrieval and localization of image data.That is,it focuses on how to model the image data with deep learning,which solves the key problem of target object localization.Exploring and solving this problem effectively overcomes the problems of information interaction between dense modal data and helps to promote the study of deep learning in common space learning.FPAN takes the full convolutional network,fine-grained "soft" attention and cascaded upsampling as the basic modules to directly process the query target image and the to-be-retrieved image,in order to achieve the accurate positioning of the target object on the image in an intelligent way.Meanwhile,it can achieve the conversion of multimodal dense data to effective retrieval of positioning information.Then,we study the problem of data imbalance that often exists in multimodal data modeling,and propose a hybrid sampling algorithm based on multiple information fusion(MIFS)to overcome the problem existing in previous sampling algorithms that a large number of actually harmful samples are being used for sampling,which results in a sharp decline in the identifiable performance of the sampled data.In order to overcome the problems caused by only using one kind of information,an unbalanced data hybrid-sampling algorithm based on multi-information fusion(MIFS)is presented in this paper.The MIFS combines the feature information learned by the boosting model with the position information of the data to define the sample,and then divides the samples into different subsets by the information contained.According to the definition of samples,the algorithm performs corresponding undersampling and over-sampling on these subsets.The data set balanced by this algorithm not only can effectively retain the information in the original data set,but also fully fill in valid samples and solve the key problem in data imbalance.Finally,this paper innovatively presents a deep learning model for fine-grained common space learning that can be used for multi-modal data retrieval to solve 1)the problem of loss of fine-grained information between modalities in traditional common space learning approach;2)the generation of corresponding local information and the common space learning are regarded as two separate stages which makes it difficult to jointly optimize the whole model.Specifically,we propose a novel multimodal LSTM with an attention alignment mechanism,namely Attention Alignment Multimodal LSTM(AAM-LSTM),which mainly includes Attentional Alignment Recurrent Network(AA-R)and Hierarchical Multimodal LSTM(HM-LSTM).Different from the traditional methods which operate on the full modal data directly,the proposed model exploits the inter-modal and intra-modal semantic relationships of local information,to jointly establish a uniform representation of multi-modal data.Specifically,AA-R automatically captures semantic-aligned local information to learn common subspace without the need of supervised labels,then HMLSTM leverages the potential relationships of these local information to learn a fine-grained common space.Experiments show that the fine-grained common space learned by AAMLSTM can effectively improve the accuracy of data retrieval.
Keywords/Search Tags:Multi-modal data fusion, Multi-modal data model, Common space learning, Progressive attention network, Multi information fusion sampling, Attention alignment multi-modal LSTM
PDF Full Text Request
Related items