Font Size: a A A

Research On Video Moment Localization Based On Natural Language Query

Posted on:2024-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:Z F TanFull Text:PDF
GTID:2568307076974749Subject:Master of Electronic Information (Professional Degree)
Abstract/Summary:PDF Full Text Request
In the information age,the perception system built on the basis of video monitoring,as well as video platforms such as Tiktok and Kwai,has generated a large amount of video data,which has become an important part of daily life.In the massive video data,users usually hope to quickly locate the target video moment through natural language description,which is a video moment localization task based on natural language query.This thesis focuses on the research of video moment localization methods for natural language queries,and the main content is as follows:(1)In response to the low efficiency of video moment localization,this thesis proposes an efficient hash based video moment localization method.The existing video moment localization methods overlook the importance of localization efficiency in practical application scenarios.The video and query statements must be input into the model network simultaneously during retrieval,which leads to low efficiency in moment localization.To address this issue,this thesis utilizes hash learning technology to improve fragment localization efficiency.This method converts query statements and videos into text hash codes and video hash code sets,and then stores the converted text hash codes and video hash code sets.The position prediction network designed by this method aims at comparing the similarity between hash codes to determine the corresponding timestamp,while the features of the video do not need to be fed back into the network during retrieval and localization processes.In addition,existing methods require complex interaction and fusion between videos and query statements.The efficient video moment localization via hashing(VMLH)method proposed in this study only requires simple XOR operations to efficiently locate video moments.This method lays the foundation for rapid localization of video moments and provides a theoretical basis for practical applications.The experimental results demonstrate the effectiveness of this method on two common datasets.(2)This thesis proposes an efficient video moment localization method based on cross modal attention hashing to address the problem of low localization and retrieval accuracy caused by poor semantic information representation between video and text modalities.Due to the high localization efficiency and the information loss caused by using hash codes as feature representations,how to compensate for these information losses and improve the representation ability of semantic information between modalities has become an urgent problem to be solved.To address this issue,this thesis utilizes cross modal attention hash learning to capture semantic association information between video and natural language modalities.The proposed method first converts query statements and original video features into concise binary hash codes through a hash learning model.Simultaneously,a soft attention module is used to weight key words in query statements,and then the video hash code and query statement hash code are input into an enhanced cross modal attention model to explore the semantic relationship between vision and language.Finally,design a score prediction and position prediction network to locate the starting timestamp of query fragments.This thesis conducted experimental verification on two publicly available datasets,and the experimental results showed that the proposed method had a significant improvement in the retrieval efficiency and accuracy.
Keywords/Search Tags:Cross modal retrieval, video moment localization, attention model, visual comprehension, hashing
PDF Full Text Request
Related items