| Since the third scientific and technological revolution,human society has stepped into the era of multi-source big data with information explosion and massive growth of data,in which there are a large number of sequential data arranged in a certain order,such as text data,voice,image,and video.For sequence modeling,the dynamic model based on deep network is driven by data to extract the data statistical characteristic and make prediction based on the history.In practical applications,sequential modeling may suffer from high data dimension or lacking of supervision information,so it is difficult to use a simple network to extract the features of complex sequence signals,which will bring great challenges to sequence data analysis and application.This paper will focus on different types of sequence data and pay attention to extract dynamic characteristics of sequential data,and explore different sequential models of various applications,including hyperspectral fusion,text generation,video restoration,multimodal fusion,etc.Furthermore,utilizing expert knowledge such as high-level semantic information,category information,dynamic information,ect,might be helpful to sequential modeling.The main works of this paper are summarized as follows:1.Recurrent neural network is an important and widely used sequential model,however,suffering from the problem of gradient vanishing when processing long-range highdimensional and complex signals.For example,the spectrum of the hyperspectral pixel is a kind of sequential data with high dimension and complex feature mixture.How to construct a reasonable recurrent mechanism for feature extraction and generative modeling of spectral signals is quite challenging.Aiming at modeling the sequential characteristics of hyperspectral data,a variational probabilistic recurrent neural network is designed to extract the sequential features of spectral signals,and the high-resolution spatial features and highresolution spectral features are fused together.Finally,high-resolution hyperspectral images can be generated in an unsupervised manner.In addition,this paper also explores the dynamic correlations in video reconstruction in video snapshot compressive imaging system,where each image in the video is generated frame by frame in a recurrent manner.Furthermore,the optical flow information between every two adjacent frames is leveraged to help the dynamic model to generate video.2.Considering that the recurrent neural network uses a single set of parameters to learn sequential features for all the training and testing samples,however,such undifferentiated modeling needs to be specifically designed for some highly diverse sample sets.For example,for hyperspectral data,the spectral characteristics of hyperspectral images are complicated due to different light,atmospheric conditions,and material mixing.In this paper,a recurrent neural network based expert mixture model with nonparametric Bayesian clustering is proposed to extract the hyperspectral features of complex heterogeneity in hyperspectral pixels.The heterogeneity between samples is introduced into the hidden space of samples as a learnable prior knowledge,which is implemented in the probabilistic hidden space under the variational framework.3.For long sequential signals,the recurrent neural network shows disadvantages in practical applications,such as short memory and inability to model long-distance correlations,time-serial processing also hinders parallelization of training samples and causes slow training speed.Therefore,in this paper,the Transformer models based on the attention mechanism are explored,which can capture long-distance feature relationships and train models in parallel.The attention mechanism directly models the relationship between any two signals at any position in the time series and learns the corresponding feature representation without considering their distance in the sequence.The Transformer model is completely composed of the attention module and aims to model the global dependence of signals at each position,which breaks the limitation that the recurrent neural network cannot train in parallel.In addition,the deep probability topic model can obtain hierarchical semantic topic information,while the semantic topic within a document is unified and helpful for language generation.Therefore,based on Transformer,this paper uses topic information to guide the generation of semantically consistent text,which is applied to two tasks,i.e.text generation and image caption,and designs a corresponding fast inference method to learn cross-modal semantic information.4.An obvious drawback of the Transformer model is that the computation consumption increases exponentially according to the input length.When modeling long text,highresolution images,or videos,the input length of the model will be limited and the performance will degrade.How to control the corresponding computational consumption while learning the long-range relationships between multiple features remains a problem to be solved.This paper introduces the idea of the graph attention model and constructs a dynamic graph model to model the video sequence data,which is able to control the computation consumption while capturing the long-term dependencies between frames.The dynamic graph model can independently select the neighbors and compute the correlations of each node feature to update the implicit representation in the graph network.The dynamic graph model is applied to the snapshot video compressive imaging and reconstruction task,dynamic sparsity is introduced into the image attention mechanism,and video optical flow information is also integrated to help learn the dynamic relationship between video frames. |