| With the rapid development of information technology and the massive popularity of smart devices,the number of videos emerging in contemporary society is increasing day by day,and the rapid development of short videos in particular has led to an explosive growth in the number of videos on the internet.It is inefficient to obtain information from a large number of videos,and if the content of a video can be converted into text,people can quickly understand the content of the video by browsing the text,thus improving the efficiency of obtaining information.This is where the task of generating video descriptions comes into play.Video is a complex type of data and its own characteristics dictate that its accurate understanding must take full account of the contextual information within it.To address this problem,the main work in this thesis is as follows:(1)A video description method combining local context and semantic awareness is proposed for solving the problem of missing details in existing video description methods,while reducing the discrepancy between video sequences and text sequences.To address the detail missing problem in existing methods,a target detection technique is first used to extract the target object features of video frames to increase the local information of the video.A temporal attention mechanism is then designed to better combine the overall and local contextual information of the video.Also,considering the differences between video and text,this thesis uses a high-level semantic property of video,which is incorporated into the conventional LSTM decoder to better guide the video contextual information to generate more accurate descriptive statements.(2)A multimodal video caption method based on memory context is proposed,which makes use of the audio information and the correlation information between videos,which are easily ignored in existing models,to improve the accuracy and diversity of the generated description statements.Most of the existing video caption methods use purely visual features of the video,ignoring the role of audio information in understanding the video,and this thesis builds on this by using audio features of the video to provide additional information for video description.At the same time,existing methods only focus on the current video being processed when generating description statements,without considering the correlation between multiple videos.In this thesis,a memory module is added to store the correlations between videos in order to generate description statements that take into account both the information of the current video and the memory information,allowing the description statements to capture a wider range of information.(3)With the video caption methods proposed in this thesis as the core,a short video description automatic generation system is designed and developed,and the development process of the system is described in detail and the application is demonstrated.In summary,this thesis proposes a video description method combining local context and semantic awareness and a multimodal video description method based on memory context to address the problem of insufficient video context information mining in current video description methods,and the experimental results show that the method proposed in this thesis can effectively improve the accuracy and diversity of video description;finally,a prototype short video description automatic generation system is implemented based on the method proposed in this thesis. |