Recently,image semantic caption generation has received increasing attention as a fundamental research problem in artificial intelligence.This tecnique works as a bridge which connects the image processing technique in computer vision and sequence generation in natural language processing.Generating descriptions of images automatically is very useful in practice,for example,it can help visually impaired people understand image contents and improve image retrieval quality by discovering salient contents.Much advance has been made in image captioning,and an encoder-decoder framework has achieved outstanding performance for this task.In this paper,we propose a novel architecture,namely Auto-Reconstructor Network(ARNet),which,coupling with the conventional encoder-decoder framework,works in an end-to-end fashion to generate captions.ARNet aims at reconstructing the previous hidden state with the present one in recurrent neural networks(RNNs),besides behaving as the information transition operator.Therefore,ARNet encourages the current hidden state to embed more information from the previous one and expolits the deeper relationships between them,which can help regularize the transition dynamics of recurrent neural networks.Extensive experimental results show that our proposed ARNet boosts the performance over the existing encoder-decoder models on image semantic captioning task.Additionally,we evaluate the discrepancy between training and inference processes for caption generation quantitatively and demonstrate that our ARNet remarkably reduces the discrepancy obviously.Furthermore,the performance on permuted sequential MNIST demonstrates that ARNet can effectively regularize RNN,especially on modeling long-term dependencies. |