Font Size: a A A

Research On Action Recognition Method Based On CLIP Pre-trained Model

Posted on:2024-09-23Degree:MasterType:Thesis
Country:ChinaCandidate:H B YuanFull Text:PDF
GTID:2568307109981299Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As one of the important sub-problems of video understanding,video action recognition has always been the focus of research in the field of computer vision.Due to the difficulty of collecting large-scale video datasets with high-quality annotations,it is difficult to train a large pre-trained model for video tasks from scratch.At present,the mainstream research method is usually to fine-tune the image pre-training model end-to-end on video data.However,this method suffers from computationally intensive and catastrophic forgetting.In order to better transfer the large pre-trained model to the action recognition task,this thesis explores the application of prompt learning and training adapter to better transfer the large pre-trained vision-language model CLIP to the video action recognition task.The main work includes :(1)For the more difficult task of long video action sequence recognition,this thesis proposes an action recognition method based on prompt learning and contrastive learning.The method contains five manually designed prompt templates,and these textual prompts can accurately and flexibly describe the videos.In addition,the method designs a prompt diversification loss function to enrich the diversity of prompt template embeddings by penalizing the redundant information in the self-attention matrix in the text encoder.And the video and text encoder are jointly trained with a contrast learning strategy.Experimental results show that the method learns not only the order of action sequences,but also the semantics of high-level activities.(2)To address the shortcomings of poor robustness and high trial-and-error cost of manually designed prompt templates,an action recognition method based on automatic prompt learning is proposed in this thesis.The method contains three shapes of automatic prompt templates,prefix prompt,cloze prompt and suffix prompt,where the contexts of the prompts are learnable vectors,while all parameters of the pre-trained CLIP model are frozen and the prompt slots are action category labels.In addition,the method contains a two-level temporal encoder for bridging the gap between images and videos.Experimental results show that the action recognition accuracy of the method can reach or even surpass that of the manual prompt template approach under few-shot settings.(3)Designing and training adapters is another line of effective transfer learning methods.This thesis also explores how to apply adapters to help the CLIP model better transfer to video action recognition tasks.In this thesis,an action recognition method based on encoder-decoder architecture is designed.The encoder is an image encoder of the frozen CLIP model,and the decoder is a Transformer-decoder integrating three temporal modeling modules.The decoder has less computational complexity than the encoder.Experimental results show that the method does not require the introduction of data from additional modalities and achieves competitive action recognition accuracy with lower computational effort and fewer number of parameters.All three proposed methods can effectively transfer the CLIP model to the video action recognition task,which can help to solve the problem of small video dataset scale to fine-tune all parameters of the pre-trained model and improve the accuracy of the model for video action recognition.
Keywords/Search Tags:Video Action Recognition, CLIP, Prompt Learning, Contrastive Learning, Patameter-efficient Fine-tuning
PDF Full Text Request
Related items