Font Size: a A A

Research On Some Problems Of Visual Semantic Understanding

Posted on:2022-01-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:F Y ZhuFull Text:PDF
GTID:1488306326480364Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Visual semantic understanding,one of the most important tasks in computer vision,poses an intriguing challenge across vision and natural language.It explores the vision-language relationship like human vision perception.With the development of research,how to understand the visual semantic information under small-samples is an urgent problem.Meanwhile,how to capture the visual semantic information in complex environments,where multiple activate objects exist,also becomes an preceding issue.In allusion to the above problems,the foundation of research as follows:1.Small-sample image classification:Multi-model learning,one of the early methods for small-sample classification,adopts the object-level text semantic information to facilitate understanding the high-level holistic semantic information.These methods adopt traditional machine learning methods,it thus their performance cannot exceed some deep learning based models.However,deep learning models always rely on training under large amounts of data,hence they cannot perform well under small datasets.To improve the deep learning based models under small-samples,some works build neural topic model to present the local semantic information,whereas these methods cost a lot of computing resource and time due to a mass of parameters.2.Video captioning:So far,video captioning adapts only to simple scenes consisted of few active objects and provides drafted descriptions with few detailed information.Especially,these methods cannot determine the current described object.In addition,the image-based methods cannot effectively understand the activities from the tempo-ral evolution and spatial movement.Moreover,most methods rely on training under large amounts of data,for the difficulty while learning one-to-many mapping func-tions.In brief,with the current methods and datasets,performing video understanding in complex scenes is an enormous challenge.To overcome these limitations,we focus on the visual semantic understanding under small-samples and obtain the following innovative progresses:1.Image-Text Dual Neural Network for Small-Sample Image Classification(IT-Dual Net)We propose the IT-Dual Net,combing the advantages of deep learning and multi-modal learning.By learning the relationships between the global semantic in-formation and the semantic information of local visual contents,IT-Dual Net thus can better understand the holistic semantic information.It can overcome the insufficient training of deep learning under small datasets,and obtain significant improvement compared to the state-of-the-arts of small-sample image classification.On the La-belMe dataset and the UIUC-Sports dataset,the classification accuracy is improved from 92.9%to 97.75%and from 99.0%to 99.51%separately.2.Object-Oriented Video Captioning via Structured Trajectory(STraNet)First,we propose a novel task,named object-oriented video captioning,transforming the drafted understanding to the fine-grained object-level understanding.It breaks the bottleneck of traditional video understanding.It can adapt to complex scenes where multiple concurrent activities happen.In addition,instead of a drafted description of an uncertain object,object-oriented video captioning aims at distinguishing differ-ent objects and describing specific objects in more details.Second,we re-annotate a new database providing object-sentence pairs,transforming the difficult one-to-many learning to simple one-to-one learning.Most of all,we design the video-based STraNet under small-samples to replace previous image-based methods,and achieve the aim of understanding activities from temporal transition of visual features and spatial movement,getting rid of dependence on supplementary cues from other tasks.Overall,the proposed STraNet does not rely on training with large amounts of data,allowing the proposed method to achieve a deeper and more efficient understanding.3.Attribute-Enhanced Object-Oriented Video Captioning(AENet)In complex scenes,it asks for detailed descriptions to distinguish different objects,e.g.,color and type of the cloth,gender,accessories.In order to get the descriptions with rich and precise detailed information and overcome the problems from unbalanced data,we propose the AENet via attribute explorer and attribute-enhanced caption generator.The proposed modules help the model capture more distinguished features among different objects and generate more precise detailed information.4.Object-Oriented Video Captioning via Adversarial Learning(OVC-GAN)How to accurately recognize the visual contents and the corresponding visual words is an enormous challenge in complex scenes.We introduce adversarial learning into video captioning and propose OVC-GAN.We design an discriminator to determine if the generated visual words are suitable for the current visual contents.Furthermore,watching the paired and unpaired samples simultaneously,the discriminator facili-tates learning the relationships between visual contents and visual word,as well as improving the overall performance.
Keywords/Search Tags:Visual Semantic Understanding, Image Classification, Video Understanding, Video Captioning, Small-Sample Learning, Deep Learning, Multi-Modal Learning, Adversarial Learning
PDF Full Text Request
Related items