| Video content understanding is one of the forward-looking directions of today’s artificial intelligence.It is an important step for intelligent systems to approach human understanding.It has important application value in social services,national security and industrial development.However,video data presents characteristics such as unstructured,highly redundant,high-dimensional,deep information hiding,and difficult to understand.How to map complex video content into a semantic space that conforms to human cognitive habits is a challenge for video content understanding..With the rapid development of artificial intelligence technology represented by deep learning,video content understanding has attracted widespread attention from academia and industry.The meaning of video semantics in this article is various features in the video,and its semantic mining is to extract the features of the video.In recent years,thanks to the outstanding performance of convolutional neural networks in image feature extraction and the rapid development of recurrent neural networks in language processing,video content understanding has achieved certain results,but there are still problems to be solved.For example,how to select key content for description in massive video data is a difficult point.At the same time,due to noise interference in video data,visual semantics and predicted content cannot be accurately aligned,and there are also insufficient visual feature extraction and utilization and video data volume.Too large leads to problems such as overfitting in the model training process and failure to use the grammatical structure of the sentence to cause incomplete or inaccurate descriptions.These problems lead to inaccurate and incomplete semantic descriptions in video content understanding.Therefore,this paper proposes the following methods to solve the above problems.1.A video content understanding model based on feature selection and semantic alignment network is proposed.By improving the feature extraction network,the important feature information is selected and extracted.First,the image frame information is mapped into feature vectors,and important features are learned and selected through a selection extraction network based on the relationship between frames,and then extracted through a fully connected layer to achieve effective extraction of video spatial features.Secondly,a new semantic alignment loss function is designed.Through the training of negative samples,the decoder can effectively identify difficult samples among them.On this basis,a new semantic alignment loss function is designed to adapt the loss calculated by negative samples.Assigning weights to improve the semantic relevance between text and pictures realizes the optimization of the model.2.A video content understanding model based on spatio-temporal feature extraction and pruning network is proposed.By extracting the feature relationship of the video frame,the connection between the features is made closer.First,use the temporal attention mechanism to extract the key frames in the video,and then extract the important area information and background information of the key frames,and then use the spatial fusion function to closely associate these two features and extract image features to achieve accurate feature recognition.and comprehensive extraction.Secondly,a new model pruning method is further designed,and the variational dropout method is used to make the model adaptively adjust the dropout rate of neurons to select an optimal value,which effectively solves the overfitting problem in the model training process.3.A video content understanding model based on visual and grammatical feature extraction network is proposed.The model is guided to textually describe video content by training a sentence grammar generator and embeddings of local target visual features.First encode the video into a feature vector,then extract local visual features and part-of-speech features of sentence grammar for training to obtain a sentence grammar generator,and then use the grammar generator to guide the model to describe the video content in words.To sum up,for simple short-description video scenes,this paper first proposes a video content understanding model based on feature selection and semantic alignment network,which solves the problem of difficult selection of key content and failure to effectively identify difficult samples during video feature extraction.Improved description accuracy for issues that resulted in inaccurate descriptions.On this basis,in order to further improve the accuracy and completeness of description,this paper proposes a video description model based on spatiotemporal feature extraction and pruning network to solve the problems of insufficient video feature extraction and utilization and model overfitting.Finally,for the scene of complex long description video,this paper proposes a video content understanding model based on visual and grammatical feature extraction network to solve the problem of incomplete and inaccurate sentences caused by expressing words in a chain structure in description generation.The experimental results show that the method in this paper can significantly improve the accuracy of video description,the method and performance are better than the baseline model,and the practical application effect is remarkable. |