Font Size: a A A

Bridging Vision And Language Together Based On Computer Vision

Posted on:2019-07-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:J XuFull Text:PDF
GTID:1318330542994131Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With rapid increasing number of images and videos on web,how to make users to know the multimedia resourece easier is one big problem currently.For users,the most straightforward way to understand an image or a video is to transfer them into languages.For computers,when the computer can know what is the talking about in the media with language description or answer the questions in the imgae,it is the time that computer really understand the whole content of image or video.Based on that,bridging vision and languages together is one meaningful problem in computer vision.More and more researchers are doing great jobs on these areas.The progress between computer vision and language is promising,but some problems are not solved well at the same time.The thesis investigates the key problems between computer vision and language from different aspects.First of all,generating the descriptions from video or images is one of our cruial problem.There are many limits in this problem currently:the existing data cannot support the task well in this domain,and the current approach cannot fully exploit the image or video structure well to suit the problem.Besides,we are also insteresed in if computer can answer the specific questions if given the image.What is the most import clues to solve this problem?Finally,how to get the best representation of language and visions together to users is our final questions.Based on these observations,the thesis conducts a comprehensively research on vision and languages analysis,and achieve the following progress:1.We have introduced a new dataset for describing video with natural languages.Utilizing over 3,400 worker hour,a vaset collection of video-sentence pairs was collected,annotated and organized to drive the advancement of algorithms for videot to text.This dataset contains the more representative video convering a wide variety of categories and to-date the largest amount of sentences.We com-prehensively evaluated RNN-based approaches with variant components on re-lated and our dataset.Before the thesis finished,the dataset has been used over 100 times in the world-wide,and be cited over 120 times.2.To better exploit the video structure between vision and languages,we propose a novel deep framework to boost video captioning by learning Multimodal At-tention LSTM model.The proposed MA-LSTM fully exploits both multimodal streams and two layer attention to selectively focus on specific elements during the sentence generation.In particular,a child-sum fusion unit is proposed to ele-gantly combine different streams.3.With frame region based feature and attributes in images,we have added faster-rcnn parts in image/video captioning and visual question answering.Including the objects and attributes information,system can know much more about what is in the image to get a better understanding.4.To get a better combination for language and vision,we have built a vivid story-board.Specifically,interesting events have been detected from search logs and representative images are selected to form the storyboard.A real application has been made to show the cool combination.
Keywords/Search Tags:deeplearning, video, image, language, CNN, RNN, multi-modality
PDF Full Text Request
Related items