| Deep learning image description is a computer visual task that involves the natural language description of the use of deep learning algorithms to generate images.It has many application value in field search,subtitles,and automatic image annotations.In recent years,with the development of the mechanism,the development of transformer models and other deep learning technologies,this field has made many progress.These models have achieved the most advanced results on the benchmark data sets(such as MSCoco and Flickr30k).However,there are still huge challenges in this field.For example,it is necessary to improve the diversity and coherent nature of the generation language,handle rare and novel concepts,and improve the efficiency of models.Overall,deep learning image description is an exciting and rapid development area,and has many practical application and research opportunities.Most of the research at this stage is based on the encoder-decoder structure.In order to enhance the accuracy of the image description,you can also introduce the attention mechanism.This study is mainly designed in the encoder part of VGG19 and Resent101.The decoder uses long and short-term memory networks,and proposes a new improvement mechanism to enhance the correlation between pictures and words,and finally outputs natural language.This study conducted comparative experiments on the two public datasets,Flickr8 k and MSCOCO,and used a variety of evaluation indicators such as Bleu,Meteor,and CIDER to comprehensively evaluate the model.The experimental results show that the image description of the improvement mechanism based on the improvement mechanism is better in the image description task.Compared with the traditional attention mechanism model,its accuracy has significantly improved.It has significant superiority.In order to further enhance the accuracy and evaluation indicators of the image description,based on the research of the encoder-decoder architecture,a model of DAA-Net(Dual Attention-Aware Network)with a dual attention mechanism is proposed,that is,the encoding part Using deep residual network Res Net101,it can better extract the characteristics of the picture.In the training process,in order to alleviate the input problems that may occur during training and testing,this study proposes a sampling strategy.The words are sampled in a certain proportion and used it as the input of the decoder.The experimental results show that the sampling strategy effectively alleviates inconsistent input to a certain extent.At the same time,multi-layer convolutional neural network(CNN)automatically extracts the image to improve the quality of the image description and allow the evaluation indicators to reach the level of human recognition.In order to effectively handle the semantic information in natural language sentences,the circulating neural network(RNN)is introduced into the decoder,thereby improving the accuracy of the image description.The decoding part uses long term memory networks.Under the processing of the DAA module,by reducing the attention mechanism of improving the model parameter,it can better handle the relationship between different goals in model images,so that the decoder can get more accurate attention to paying attention In the area of force,the introduction of layer home to one method to reduce the time of model training,obtain the language description of the image,and enhance the evaluation indicator of the image description to generate models.The use of encoder architecture and attention mechanism has led to a significant improvement in the field of image description,so that deep learning technology can be used to describe the image more accurate,diverse and coherent.At the same time,the attention mechanism determines the correlation between the results and the query results to enhance the correlation between the picture and the words,and finally output the natural language.The improved attention mechanism model used during the image description,combined with the comparison verification results of the public data set MSCOCO and Flickr30 k,shows that the model shows significant superiority in the acceleration model convergence rate,improvement of relevant evaluation indicators,and enhanced model performance.Compared with the traditional general attention mechanism model,the relationship between different targets in modeling images has better handled the decoder to obtain more accurate attention areas,improve the relevant evaluation indicators and enhance the model performance.With significant superiority and achieve more accurate image description generation. |