With the rapid development of social media and computer network technology,a large amount of multimedia data is transmitted in the network,among which,video data is one of the most representative and complex types of multimedia data,and it becomes increasingly important to automatically extract useful information from the huge amount of video data.The task of video description has become one of the hot research directions in recent years due to its huge potential application value in human-computer interaction,video surveillance and video retrieval.Unlike image description,video description requires understanding the context of the video,which is difficult for describing open domain video problems,not only because the video contains dynamic objects,scenes,actions and other information,but also because it is difficult to determine the order of various complex information in the video and how to express it in an accurate and concise language,so video important information mining and optimised language description are important problems that must be solved in the video description task.Therefore,the mining of important information and the optimisation of linguistic descriptions are important issues that must be addressed in the task of video description.The main research work and contributions of this paper to address the above issues are as follows:1.In the current video description task,the spatial redundancy information in the video feature is usually not effectively eliminated,and the commonly used loss function is composed of the probability logarithm of the target correct word,and the long sentence formed often brings great loss to the model.Conversely,sentence lengths that are too short after log-likelihood loss function optimization result in incomplete description semantics and low accuracy.This paper proposes a video description method based on semantic information filtering and sentence length modulation to solve the above problems.Firstly,in the coding stage,the model introduces a gating fusion mechanism,which removes redundant or unimportant information in the semantic information of video features through the screening of video semantic features,reduces the interference of redundant information on the generated description,and improves the accuracy of the description.Secondly,in the decoding stage,a new sentence length modulation loss function is proposed,and the cross-entropy loss function is modulated with the label sentence length,which alleviates the tendency of the model to generate short sentences,so that the semantics of the generated description are close to the label,thereby improving the accuracy of the description.Experiments involving ablation and comparisons on commonly used MSVD datasets in this field have demonstrated the significant accuracy improvement achieved by the proposed method for generating video descriptions.The results show that all metrics outperform existing models by a significant margin.2.In general video description tasks,the internal relationship of multiple video features is often not fully utilized,and sentence-level semantic consistency is ignored in the process of generating descriptions,resulting in inaccurate sentences.This research article introduces a novel approach for generating video descriptions,which leverages feature enhancement fusion strategy and semantic consistency to address the aforementioned issues.Firstly,in the coding stage,the model introduces a feature enhancement fusion strategy,which uses the relevant information between different features to reinforce each other,and assigns different coefficients according to the importance of the features.In the decoding stage,the characteristics are selected according to the coefficient.Next,the sentence-level semantic features loss function is integrated into the word-level cross-entropy loss function to encourage the predicted sentence’s semantic vector to be consistent with that of the actual label.This enhances the model’s ability to generate more accurate semantic descriptions.Experimental results demonstrate the effectiveness of this approach,with significant improvements in the indicators of BLEU@4,METEOR,ROUGE,and CIDEr compared to existing models.3.Implement a video retrieval and review application based on video description model,through subtitle extraction,video description and other processing of a variety of news videos,obtain video subtitles,keywords,video scene descriptions,video categories and other tag information,according to the scene description,keywords and other information to facilitate video retrieval,according to the video keywords to determine whether the video content is compliant.Experiments have proved that the system plays an important role in accurately retrieving videos and can efficiently conduct video review.In summary,this paper works to improve the accuracy of video description generation in terms of both features and descriptions.Experimental evaluation of this paper’s model,compared with existing state-of-the-art models,shows the superior performance of the proposed model.In addition,the model also provides technical support for the application of video description technology to the field of video retrieval and auditing.It not only achieves accurate retrieval of relevant videos and improves the utilisation of media resources,but also enables the analysis of multi-modal information anomalies such as frames and subtitles for videos to be played,which improves the efficiency of video censorship and enables the screening of better push content and the blocking of non-compliant content dissemination. |