With the rapid development of electronic equipment technology and network communication technology,portable computers,smart phones and other devices have enabled human society to gradually enter the era of self media.A large number of short videos emerged on public network platforms need to be analyzed and managed urgently.Therefore,the study of intelligent video analysis technology has great practical value.Natural language description of video,as an important branch of computer vision,has brought higher requirements to the research of intelligent video analysis and detection technology.There are a lot of illegal videos on social video platforms,such as Kuaishou and Douyin,but now it is mainly relied on artificial means to control the transmission of such videos.In addition to prohibiting illegal video and maintaining network security,this technology can also assist the network public platform in efficient video retrieval and classification management.Based on solving practical problems,this thesis proposes a natural language generation model with strong robustness and high accuracy.The key problems and research methods are as follows:1.Aiming at the incompleteness of feature extraction of open domain video,a multi-modal feature extraction method is proposed.Independent feature extraction models are used to extract RGB feature,optical flow feature,audio feature and C3D(Convolutional three Dimension)feature of the video.It takes into account the multiple dimensional information of the video.Experimental results show that the proposed method has a certain degree of robustness which can maintain good effect of object and action detection in any scene and provides a complete and deep video representation.2.To solve the problem of low accuracy of natural language model,a bi-directional natural language model based on multi-modal attention mechanism is proposed.The model is divided into two phases,namely encoding and decoding.In the encoding stage,different modal features are input into a separate bidirectional RNN(Recurrent Neural Network),to be exact,LSTMs(long-short-term memory units),and the output is hidden state vectors.Bidirectional encoder can encode feature vectors in both forward and backward directions,which is more effective and comprehensive than unidirectional model.In the decoding stage,multi-modal attention mechanism is introduced to fuse all kinds of hidden states,which makes it possible to decode the sequence state vectors into sequence words more accurately.Experimental results show that the proposed natural language model improves the accuracy of video description statements.3.In view of the current related research is to generate English descriptive sentences and few related to Chinese,this thesis proposes a method to generate Chinese sentences for video.This thesis mainly solves the problems of Chinese database construction and Chinese character representation and processing.The methods and results presented in this thesis can be an effective reference for the followup researchers and have great significance for computer vision and natural language processing. |