Font Size: a A A

Task-driven Visual Media Text Description Technology

Posted on:2020-06-06Degree:MasterType:Thesis
Country:ChinaCandidate:M GaoFull Text:PDF
GTID:2438330575959491Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Because of the exponential growth of personal data collected by people,the amount of image and video data also increases.Compared with text,people now widely use text with images or videos to record life.However,because of the large amount of image or video data,when uploading images and videos on various social software,people can not quickly and accurately find interested images or video clips.To meet the above requirements,a cross-modal video diary retrieval method based on video captioning model is proposed in this thesis.By automatically generating natural language descriptions through analyzing video content,the cross-modal conversion between video and text is realized,which helps people retrieve the video clips needed in the huge video database.In addition,aiming at the effect of image resolution on image captioning,an improved image super-resolution reconstruction algorithm based on cascaded residual learning convolution neural network is proposed in this thesis,which applies superresolution image to image captioning and improves the accuracy of image captioning.1)A retrieval algorithm for text and video diaries based on video captioning is proposed in this thesis,which consists of three processes.Video shot segmentation.Video shot segmentation method based on wavelet transform can segment video adaptively and detect shot boundaries better.So in this thesis,the method of video shot segmentation based on wavelet transform is used.Firstly,the brightness difference between video frames is decomposed by wavelet multi-resolution,then the modulus maxima point is obtained by denoising,and finally the shot boundary is found by tracking the modulus maxima point,thus the video is segmented into short video clips with different scenes.Video captioning.In this thesis,a caption-guided visual saliency automatic description method is used.This method reveals the mapping relationship between image regions and words in modern encoder-decoder networks.It is implicitly learned from the caption training data,and can generate temporal or spatial heatmaps for predicted captions or arbitrary query sentences.Vector representation of text.In order to represent video description and diary description with fixed length vectors.In this thesis,we use an unsupervised algorithm to learn fixed length feature representation from variable length text,which overcomes the disadvantage of bag-of-words model's disorder and lack of semantic information.2)An improved image super-resolution reconstruction algorithm based on cascaded residual learning convolution neural network is proposed.In the process of image restoration,some high-frequency components can not be restored from low-resolution image to highresolution image through existing convolution neural network(CNN)based methods.Therefore,an improved image super-resolution method based on cascaded residual learning convolution neural network is proposed in this thesis.In this method,the sum of the high resolution image restored by the first residual learning network and the residual estimation image is taken as the input of the second residual learning network,and the unrecoverable residual components are learned again.Moreover,image super-resolution is applied to image captioning,and improves the accuracy of image captioning.In this thesis,the video captioning model is applied to the actual problem of user's text diary retrieval video diary,and investigates the experimenter's satisfaction with the matching of video diary,and most people express their satisfaction.In addition,the image superresolution method is applied to image captioning in this thesis.The results show that when the image resolution is low,improvement of resolution will also obviously improve the accuracy of image generation description.
Keywords/Search Tags:Video Segmentation, Video Captioning, Vector Representation of Text, Lifelog Video, Super-Resolution
PDF Full Text Request
Related items