| In recent years,with the rapid development of technology,multimedia data on the Internet has exploded.Cross-modal field has become a hot topic as it involves multi-modal data including images,texts etc.Combined with deep learning technology,this dissertation explores the task of image caption and cross-modal retrieval in the cross-modal field.The main work is as follows.1.The influence of CNN,text features and RNN parameters on the experimental results is explored.An image caption model based on CNN network and LSTM network is proposed by changing the way of the image inputting to LSTM,and the validity of the model is verified by experiments.2.A multi-instance learning model is used to extract the text labels of the image which input as Attention information,and we realized an image caption model combining Attention mechanism,CNN network and LSTM network and improves the performance of the model.3.The cross-modal retrieval system based on deep learning models is realized.We use the VLAD encoding method to represent the text,and apply multiple deep networks to extract the image features.And then,DCCA-PHS is used to maximize the relevancy of different modal data.Experiments show that compared with the traditional feature representation methods,the system has achieved significant improvement in cross-modal retrieval datasets. |