Font Size: a A A

High-level Semantics Based Cross-modality Applications

Posted on:2019-01-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y W PanFull Text:PDF
GTID:1318330542997988Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The rapid development of Web 2.0 technologies has led to the surge of research activities in multimedia field.In such a multimedia era,we tend to spontaneously create and share the common multimedia materials(e.g.,images and videos)in our daily lives.During this process,users are not content to merely deliver the visual information(im-age/video),and sometimes even prefer to additionally share their understanding of im-age/video content from semantic perspective,i.e.,the corresponding textual information(e.g.,tags,phrases or descriptions).Such conjunction of visual and textual information enables a lot of cross-modality applications in multimedia field,including querying re-lated images from keywords(keyword-based image search),sentence generation from videos(video captioning)and some brave tasks such as directly generating video con-tent from captions.The spirit of these cross-modality applications is the transformation between visual and textual information.However,most existing approaches for cross-modality applications heavily rely on the trending cross-view learning and deep learn-ing techniques to simply enable the transformation across different modalities,while the inherently high-level semantics across different modalities are not fully exploited.It is crucial to emphasize the high-level semantics in the transformation across different modalities for understanding the visual content and eventually enhancing the quality of the cross-modality transformation.To address this issue,this thesis starts with the high-level semantics that inherent-ly lie between visual and textual information,and studies how to leverage high-level semantics to facilitate and boost cross-modality applications,i.e.,keyword-based im-age search(text-to-image),video captioning(video-to-text)and video generation from captions(text-to-video).In summary,this thesis makes the following contributions:(1)We propose a high-level semantics based method for keyword-based image search by exploiting the high-level semantic relationship between textual query and im-age from click-through data in search engines logs.The high-level semantic relationship is further leveraged to build a latent subspace with the ability in directly measuring the similarity between textual query and image.The latent subspace is learnt by jointly minimizing the distance of the observed query-image pairs and preserving the inherent structure in original single view.The experiments on a large-scale click-based image dataset show that our proposed method achieves the improvement over Support Vector Machine based method by 4.0%in terms of relevance.(2)We propose an implicit high-level semantics based video captioning model for describing videos with coherent and relevant sentences.This model simultaneously explores the learning of LSTM and visual-semantic embedding,which considers both the contextual relationship among the words in sentence via LSTM,and the implic-it relationship between the semantics of the entire sentence and video content through visual-semantic embedding,for generating natural language of a given video.Extensive experiments are conducted on three video captioning benchmarks to verify the effec-tiveness of our model.In particular,on YouTube2Text dataset,our proposed method achieves the improvement over LSTM based method by 4.7%in terms of METEOR.(3)Besides leveraging the implicit relationship between the sentence and video content,we propose an explicit high-level semantics based approach to further boost video captioning.We firstly mine the explicit semantics in videos(i.e.,semantic at-tributes)and then dynamically incorporate them into conventional RNN based archi-tecture to enhance the semantic relationship between the generated sentence and video content.Experiments conducted on the same three video captioning datasets validate our explicit semantic based video captioning model.Performance improvements are clearly observed when comparing to other captioning techniques.Note that our pro-posed LSTM-TSA achieves to-date the best published performance in sentence gener-ation on MSVD:52.8%and 74.0%in terms of BLEU@4 and CIDEr-D.(4)We propose a high-level semantics based video generation method,enabling the transformation from sentences to videos.This model explores both semantic and temporal coherence in designing Generative Adversarial Networks(GANs)to generate videos consisting of visually coherent and semantically dependent frames.We demon-strate the capability of our model to generate plausible videos conditioning on the given captions on two synthetic datasets and one real-world dataset.
Keywords/Search Tags:Image Search, Cross-view Learning, Convolutional Neural Networks, Recurrent Neural Networks, Video Captioning, Semantic Attributes, Generative Adversarial Networks, Video Generation
PDF Full Text Request
Related items