| The development of digital multimedia technology and the popularization of smart mobile devices has caused explosive growth in the scale of video data.To deal with such largescale video data,the manual approach to video understanding is time-consuming and laborintensive,which is difficult to meet the requirement in realistic scenarios.It is urgent that utilizing machines to perform efficient video understanding to provide decision candidates quickly for humans.Traditional video understanding includes tasks such as human behavior recognition,video classification,action detection,and video retrieval.However,video not only contains image sequences,but also other modalities such as subtitles and voices.Therefore,multi-modal video understanding has attracted ever-increasing attention and has shown important research value and significance in intelligent surveillance,human-computer interaction,cross-modal retrieval,and intelligent content editing.This thesis focuses on the task of referring video segmentation,that is,taking the video segments and corresponding textual descriptions as input and producing the segmentation results of video objects related to the descriptions.Recently,although many researchers have made some progress in referring video segmentation,there are still numerous problems to be addressed,such as the complexity of text description,the lack of labeled data,the impact of video quality,the imbalance of positive and negative samples and other issues.The thesis focuses on the following four aspects of referring video segmentation:We propose a novel asymmetric cross-guided attention network for referring video segmentation.Due to the diversity of human language descriptions,the same object in the same scene may have large differences in textual descriptions.Therefore,we study the visionguided language attention to extract robust linguistic features and promote the association between visual pixels and textual descriptions.At the same time,the context information of visual pixels plays a crucial role in semantic segmentation performance.Therefore,the language-guided vision attention is studied,where visual global context related to the textual description is incorporated to enhance the context modeling of visual pixels.In addition,the multi-resolution fusion and weighted binary cross-entropy loss are adopted to utilize the segmentation results of different granularities and pay more attention to foreground pixels respectively,thereby improving the segmentation performance.We propose a context modulated dynamic network for referring video segmentation.Traditional dynamic convolution has the defect that it often generates a single-channel convolutional kernel to avoid introducing many parameters.To address this issue,we adopt group convolution and point-wise convolution technology to realize multi-channel convolutional kernel generation under the premise of little increase in parameters,which provides a foundation for continuous interaction between vision and language.In order to solve the problem that the convolution kernel generated by traditional dynamic convolution has nothing to do with spatial and content information,we utilize the dynamic convolutional kernel generated by the visual feature containing contextual information to modulate the dynamic convolutional kernel generated by the language,simultaneously achieving the visual context modeling and multi-modality alignment.Besides,the use of a deformable convolutional kernel and a convolutional long short-term memory network enhances the capability of transformation modeling and extracts complementary motion information respectively,thereby improving the performance of referring video segmentation.We propose an object-agnostic transformer-based network,without using the reliable pretrained object detectors.Since most of the existing methods lack the contextual information modeling within each modal or design complex architectures for joint intra-modal and inter-modal modeling,a more universal and easy-to-extensible model based on multi-modal Transformers is proposed for referring video segmentation.Moreover,in order to balance the segmentation performance and computational efficiency,a cascaded segmentation network is proposed to decompose the task into coarse-grained segmentation and fine-grained refinement.Besides,to analyze the performance from a more balanced perspective,we propose a novel metric to consider the difficulty of samples.In the future,the model can also be combined with the self-supervised learning paradigm,using large-scale unlabeled data on the Internet for training,providing technical support for large-scale pre-training.We introduce a new challenging task,weakly supervised referring video segmentation,and propose a progressive reliable annotation method.With the development of Internet technology and the popularization of mobile smart devices,the expansion of the scale of video data makes manual labeling more and more time-consuming and laborious.By observing the distribution pattern of the samples,we can advocate that the existence of easy samples in the dataset allows us to obtain reliable annotation information,with the help of the off-theshelf models pre-trained on referring image segmentation.Therefore,the dataset is divided into easy samples and hard ones,combined with knowledge distillation technology to carry out transfer learning between models,guiding the learning of student network.At the same time,through the knowledge distillation within the student network,it makes full use of the reliable pseudo-label information generated on the hard samples and provides further supervision for training the student network,thereby improving the segmentation performance. |