Font Size: a A A

Video Analysis Based On Sequence Deep Learning:Modelling,Representation And Application

Posted on:2018-08-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:X ShenFull Text:PDF
GTID:1318330515996023Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Recently,with the ever-increasing use of smartphones and digital cameras,peo-ple are now able to record videos anywhere and any time.Video semantic analysis is required when we want to save,recognize,share,edit or generate this large amount of video data.Meanwhile,deep learning has led to a booming progress in computer vision since 2012,which makes analysis of large scale video data possible.Consequently,analyzing videos based on deep learning is a natural choice.Generally,current deep learning based video semantic analysis systems consist of two steps:1)Use Convo-lutional Neural Networks(CNN)to obtain the visual representation of each frame.2)Learn the fixed-length representation of the video based on these frames and then de-code the semantic label or sentence based on LSTM.In this paper,we first investigate and summarize all ralated works of video se-mantic analysis in recent years,then perform some research on video classification and video captioning.From the point of visual representation of video frames,we propose continuous dropout,parameter-based robust Convolutional Neural Networks(CNN),architecture-based robust Convolutional Neural Networks,trying to improve current CNN on feature fusion and feature extraction.Moreover,in order to improve the per-formance and training efficiency of multi-layer Long Short Term Memory(LSTM),we propose a greedy layer-wise unsupervised pre-training algorithm.In addition,aiming at break the limitation of current mapping-based sequence-to-sequence learning frame-work,we propose to do sequence-to-sequence learning based on shared latent represen-tation,which poses a new viewpoint of video and sentence analysis.The main contri-butions of this paper are summarized as follows:Continuous Dropout Dropout has been proven to be an effective algorithm for training robust deep networks because of its ability to prevent overfitting by avoiding the co-adaptation of feature detectors.According to the activation patterns of neurons in the human brain,when faced with different situations,the firing rates of neu-rons are random and continuous,not binary as current dropout does.Inspired by this phenomenon,we extend the traditional binary dropout to continuous dropout.On the one hand,continuous dropout is considerably closer to the activation char-acteristics of neurons in the human brain than traditional binary dropout.On the other hand,we prove that continuous dropout has the property of avoiding the co-adaptation of feature detectors,which suggests that we can extract more in-dependent feature detectors for model averaging in the test stage.We introduce the proposed continuous dropout to a feedforward neural network and compre-hensively compare it with binary dropout,adaptive dropout,and Drop-Connect on MNIST,CIFAR-10,SVHN,NORB,and ILSVRC-12.Thorough experiments demonstrate that our method performs better in preventing the co-adaptation of feature detectors and improves test performance.parameter-based robust CNN Convolutional neural networks(CNNs)have achieved state-of-the-art results on many visual recognition tasks.How-ever,current CNN models still exhibit a poor ability to be invariant to spatial transformations of im-ages.Intu-itively,with sufficient layers and parameters,hierarchical combina-tions of convolution(matrix multiplication and non-linear activation)and pooling operations should be able to learn a robust mapping from transformed input im-ages to transform-invariant representations.In this paper,we pro-pose randomly transforming(rotation,scale,and transla-tion)feature maps of CNNs during the training stage.This prevents complex dependencies of specific rotation,scale,and translation levels of training images in CNN models.Rather,each convolu-tional kernel learns to detect a fea-ture that is generally helpful for producing the transform-invariant answer given the combinatorially large variety of transform levels of its input feature maps.In this way,we do not require any extra training supervision or modifica-tion to the optimization process and training images.We show that random transformation provides significant im-provements of CNNs on many benchmark tasks,including small-scale image recognition,large-scale image recognition,and image retrieval.architecture-based robust CNN Convolutional Neural Networks(CNNs)have demon-strated state-of-the-art performance on many visual recognition tasks.However,the combination of convolution and pool-ing operations only shows invariance to small local location changes in meaningful objects in input.Sometimes,such net-works are trained using data augmentation to encode this in-variance into the parameters,which restricts the capacity of the model to learn the content of these objects.A more ef-ficient use of the parameter budget is to encode rotation or translation invariance into the model architecture,which re-lieves the model from the need to learn them.To enable the model to focus on learning the con-tent of objects other than their locations,we propose to conduct patch ranking of the feature maps before feeding them into the next layer.When patch ranking is combined with convolution and pooling op-erations,we obtain consistent rep-resentations despite the lo-cation of meaningful objects in input.We show that the patch ranking module improves the performance of the CNN on many bench-mark tasks,including MNIST digit recognition,large-scale image recognition,and image retrieval.Greedy Layer-wise Training of Multi-layer LSTMs Recent developments in Recur-rent Neural Networks(RNNs)such as Long Short Term Memory(LSTM)have shown promising potential for modeling sequential data,especially in the fields of computer vision and natural language process-ing.Nevertheless,training LSTM is not trivial when there are multiple layers in the deep architectures.This difficul-ty originates from the initialization method of LSTM,where gradient-based opti-mization often appears to converge to poor local solutions.In this paper,we ex-plore an unsupervised pre-training mechanism for LSTM initialization,following the philosophy that the unsupervised pretraining plays the role of a regularizer to guide the subsequent supervised training.We propose a novel encoder-decoder-based learning frame-work to initialize a multi-layer LSTM in a greedy layer-wise manner in which each added LSTM layer is trained to retain the main information in the previous representation.A multi-layer LSTM trained with our pretraining method outperforms the one trained with random initialization,with clear advan-tages on several tasks such as regression(Adding),recog-nition of handwritten digits(MNIST),video classification(UCF-101)and machine translation(WMT 14).Moreover,we show that the multi-layer LSTMs converge 4 times faster with our greedy layer-wise training method.Sequence-to-Sequence Learning via Shared Latent Representation Sequence-to-se-quence learning is a popular research area in deep learning,such as video caption-ing and speech recognition.Existing methods model this learning as a mapping process by first encoding the input sequence to a fixed-sized vector,followed by decoding the target sequence from the vector.Although simple and intuitive,such mapping model is task-specific,unable to be directly used for different tasks.In this paper,we propose a star-like framework for general and flexible sequence-to-sequence learning,where different types of me-dia contents(the peripheral nodes)could be encoded to and decoded from a shared latent representation(SLR)(the central node).This is inspired by the fact that human brain could learn and ex-press an abstract concept in different ways.The media-invariant property of SLR could be seen as a high-level regularization on the intermediate vector,enforcing it to not only capture the latent representation intra each individual media like the auto-encoders,but also their transi-tions like the mapping models.Moreover,the SLR model is content-specific,which means it only needs to be trained once for a dataset,while used for different tasks.We show how to train a SLR mod-el via dropout and use it for different sequence-to-sequence tasks.Our SLR model is validated on the Youtube2Text and MSR-VTT datasets,achieving state-of-art performance on video-to-sentence task,and the first sentence-to-video results.
Keywords/Search Tags:video semantic analysis, deep learning, dropout, convolutional neural net-works, sequence learning
PDF Full Text Request
Related items