Font Size: a A A

Image Caption Model Based On Feature Extraction Via Dense Convolutional Neural Network

Posted on:2020-10-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y L HaoFull Text:PDF
GTID:2428330575956430Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In recent years,the enhancement of the computational function of the Graphic Processing Unit(GPU)have promoted the rise of artificial intelligence and deep learning.Among them,the rise of computer vision,augmented reality,natural language processing,virtual reality,speech recognition and many other industries have greatly affected people's lives.In the field of computer vision,the emergence of datasets such as ImageNet,COCO,VOC and the introduction of classical convolutional neural networks models(such as AlexNet,VGG,ResNet,Inception,DenseNet)have greatly promoted the in-depth study of many computer vision tasks(image classification,object detection,object tracking,image semantic segmentation,image caption,etc.);in natural language processing,the introduction of Encoder-Decoder model,Seq2Seq model and Attention mechanism make machine translation,text mining,sentiment analysis,system question and answer have a good development;At the same time,in speech recognition,speech feature extraction,text patterns match and so on have been extensively studied.Image caption is a very popular research topic in artificial intelligence.It combines two maj or fields of computer vision and natural language processing,and it is also widely used,such as image translation,image retrieval,early childhood education and so on.In this paper,extracting image feature map is based on the dense convolutional neural network,and the description text corresponding to the image is generated by the effective combination of the "Visual Attention Switch" and the Long short-term memory network(LSTM).This thesis first introduces some concepts about image caption,including the mainstream framework of the model,the evaluation criteria of the model performance and the research status.Based on this,the feature map of the newly emerging convolutional neural network is summarized.Next,we introduce the method we proposed to convert the image into the corresponding Chinese description.This approach follows the most basic Encoder-Decoder architecture.The encoder section extracts the feature map of the input image using the recently proposed densely convolutional neural network(DenseNet).At the same time,the decoder portion uses a long short-term time memory network(LSTM)to parse the feature map generated by the encoder portion to the description sentence.We combine the feature maps by encoder extracting and the Word-Embedding vector of current moment input vocabulary effectively to predict the vocabulary of the next moment.Finally,we describe the details and results of the experiment,compare the performance of the proposed model with other models.
Keywords/Search Tags:image caption, visual attention switch, encoder-decoder, densenet, long short-term memory
PDF Full Text Request
Related items