With the tremendous development of modern information technology,people's life style gradually become digital,cyber and intelligent.Users spread large number of pictures on the Internet through mobile phones,pad,computers and other devices,making pictures an important carrier for online information circulation.These images contain large number of instances in natural scenes.Faced with these valuable image data resources,people hope that computers can automatically analyze and understand the images in natural scenes,then describe them meaningfully and logically.Since image caption for natural scenes has broad application requirements in intelligent labeling,human-computer interaction,disability services and education,nowadays it has been a research hotspot jointly committed by many scholars and institutions at home and abroad.The essence of image is that enable computer accurately detects and recognizes objects in a given image,combining the scene information to understand the content of the image,and finally providing a language description that humans can understand.Compared with image classification,image detection,image segmentation and other fields,image caption involves image processing and language modeling,therefore it is necessary to organically integrate computer vision and natural language processing.Our works are as follows:A ‘encoder-decoder' framework was introduced from the field of machine translation,constructing an end-to-end image caption model,which uses the encoder to complete the feature extraction task,and uses the decoder to complete the textual description task.Since the feature extraction ability of convolutional neural networks in image processing has been verified in recent years,a sophisticated convolutional neural network module was designed and constructed,which enables the encoder to extract visual features with strong expression ability.After that,a long short-term memory model based decoder with attention mechanism was introduced,therefore the image caption model can selectively focus on specific regions and generate corresponding words.The ‘encoder-decoder' image caption model based on deep model is quite capable of accomplishing the given task.However,deep model generally has a large scale,and the high time complexity and space complexity restrict its implementation in a wider field.In this thesis,a hybrid optimization method based on network pruning and tensor decomposition was studied for the basic components of "encoder-decoder",aiming to reduce the temporal and spatial complexity.Firstly,the data-driven global supervised iterative method is used to decompose the convolutional layer,then the convolution kernels and neurons are sorted according to the importance evaluation criteria proposed in this thesis,after that the relatively unimportant convolution kernels or neurons are pruned.Finally,the proposed methods were verified in public datasets such as MS COCO and Flickr.The experimental results showed that above method can quickly and accurately understand a given image,then give a logic description in line with human habits.Besides,comparative analysis with similar methods also proved the effectiveness of proposed methods.Another experimental results showed that the proposed optimization method can effectively reduce the temporal complexity and spatial complexity of the deep model. |