| With the advancement of information technology and the rapid development of the Internet,the video resources that people can retrieve can be counted by million.With the popularity of mobile smart devices,video has become an important information medium in people's lives and work.Since ancient times,the use of natural language for inquiries is the most accustomed way for people to get information,so natural language-based video retrieval will play an important role in people's work life.Video-natural language retrieval refers to the matching of video clips in a given natural language description to a video.According to the length of the retrieved video,the problem is divided into short video retrieval and long video retrieval.In short video retrieval,because video is short,it is often accurately retrieved by segmentation and aggregation.Short video retrieval is the basis of long video retrieval.In long video retrieval,because of the long video and a large amount of semantics,people often use segmentation,matching,and refinement methods to accurately locate.Based on the analysis of existing research on video-natural language retrieval based on deep learning network,this paper studies key issues such as language features,similarity measure and effective sample selection,from short video and long video retrieval.The following research was carried out:For short video-natural language retrieval,this paper proposes a Moment Retrieval Network(MRN)based on model expression optimization.The traditional video-natural language retrieval model uses global long and short term memory networks to extract linguistic features and traditional metrics for similarity calculation.There are problems such as not highlighting the key information of the sentence,the model is complex,and the metric expression is not comprehensive.In response to these problems,we propose a method of grouping words according to their part of speech to highlight key information and reduce the complexity of the model,as well as a deep inner product metric with powerful expressive ability.Experimental results on the short video-natural language retrieval data set DEDIMO show that the MRN network in this paper can retrieve more accurate video fragments than the existing methods.For long video-natural language retrieval,this paper proposes a Moment Localization Network(MLN)based on effective sample selection.Long video-natural language retrieval has the problem of small amount of data and uneven sample validity after data augmentation.Aiming at the problem of semantic confusion of negative sample data in triple training,we propose a semantic sliding window method for selecting effective negative samples.However,the number of samples is still large,and it takes a lot of time to train with all the triples.In order to solve this problem,combined with the difference in the effectiveness of the triples in the triple training,we propose an effective triplet sample selection method based on uncertainty,which can train better model with fewer samples.The experimental results on the long video-natural language retrieval data set TACoS show that the proposed MLN network can localize more accurately than the existing methods. |