Deep learning has achieved great success in speech recognition and image recognition.Recently,Recurrent Convolutional Neural Networks(RCN),which combines the merits of Convolutional Neural Networks(CNN)and Recurrent Neural Networks(RNN),is proposed to encode the spatio-temporal information contained in the video.However,the RCN suffers from overfitting due to too many parameters and lack of training data.In this paper,we first put forward a Shared GRU-RCN(SGRU-RCN)model to reduce the number of parameters by sharing the input-to-hidden parameters in the original GRU-RCN architecture.Thus our SGRU-RCN has less possibility of overfitting.Then,we propose a SeqVLAD model that integrates the SGRU-RCN and VLAD encoding method into a whole framework.In particular,we utilize the SGRU-RCN to learn the spatio-temporal assignments of the successive convolutional feature maps which are extracted from the continuous video frames.With the learned assignments,the VLAD encoding methods could aggregate the local descriptors on both the detailed spatial information in separate video frames and fine motion information in successive video frames.Furthermore,we conduct experiments on the task of video action recognition and demonstrate the effectiveness and excellent performance of our method.Finally,we conduct experiments on the task of video captioning to illustrate the good extendable capabilities of the proposed method. |