Font Size: a A A

Research On Visual Semantic Embedding Based On Deep Learning And Word Embedding

Posted on:2020-10-18Degree:MasterType:Thesis
Country:ChinaCandidate:Z B YangFull Text:PDF
GTID:2428330623461017Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the age's development,the trend of the carrier of information is becoming increasingly multimodal,and Multimodal Representation Learning has gradually become the focus of research scholars.In the field of Computer Vision,the error rate of Visual Recognition System in Image Classification is lower than average level of human beings;in the field of Natural Language Processing,the level of Machine Translation is enough for human's routine use.While,in the field of Image Description/Caption,which needs to make use of both image and text,the level of computer is still unsatisfactory.Therefore,how to effectively combine the advantages of Computer Vision and Natural Language Processing,and effectively utilize the complementarity between images and texts,has become a new research hotpot for scholars in recent years.Visual Semantic Embedding is to study how to make use of the complementarity between image and text,eliminate the redundancy between them,and get better image representation and text representation.Nowadays,with the rise of Deep Learning,the field of Computer Vision and Natural Language Processing have both entered the deep learning stage.In the field of Computer Vision,Convolutional Neural Networks have become the preferred processing model for image-related problems;in the field of Natural Language Processing,the development of Recurrent Neural Networks and Word Embedding models has also brought applications such as Text Classification and Machine Translation closer to human's levels.The subject studied in this thesis is how to effectively use Deep Learning and Word Embedding models,to improve Visual Semantic Embedding model,and obtain better image representation and text representation,as well as the real semantic structure representation inside them,so as to improve the learning level of Image Description and other areas of image and text intersection.In this thesis,we give full play to the advantages of Convolutional Neural Networks,Recurrent Neural Networks and Word Embedding models respectively,to study Visual Semantic Embedding.In general,the contributions of this thesis mainly include the following two aspects:1)We propose a Visual Semantic Embedding learning framework based on Word Embedding Averaging.Through the joint learning of Convolutional Neural Networks and Word Embedding Averaging within the framework,it can unify the representation spaces of image and text into the common embedding space.In terms of image,we first use the Convolutional Neural Networks to extract the features of the image.In terms of text,we first embed words into the model to get the vectorized representation of each word,and then take its element-average as the feature of the text.Finally,the error between image features and text features is reduced by combining the Triplet Ranking Loss with Hard Negatives Mining.By applying Transfer Learning to the application of Image Similarity Detection,the experiment proves that our model can extract correct semantic features from images and generate similar vectorized representations for similar images.At the same time,we focus on the influence of Hard Negatives Mining and different CNN structures on the model's performance.2)We introduce Word Embedding Initialization and Text Data Augmentation into the Visual Semantic Embedding learning framework based on Recurrent Neural Network,which can achieve better common representation learning of the two modalities.In image aspect,we apply the most widely used and effective Convolutional Neural Networks.In text aspect,we apply the Recurrent Neural Networks which are good at processing sequential data,and utilize the Word Embedding models to initialize the text encoder in the Recurrent Neural Networks.The performance is compared with that of the model with or without Text Data Augmentation.On the loss function part,we choose the Triplet Ranking Loss with Hard Negatives Mining.In the experiment part,we perform Transfer Learning to do simple arithmetic operations between image vectors and word vectors on a small dataset,which shows that our model can well learn semantic features from images.Compare with the other six models,the experiments demonstrate that our proposed Visual Semantic Embedding learning framework performs better in tasks such as Image Annotation and Image Search.In addition,we also analyze the influences on the model of the percentage of training set used in model learning and the Word Embedding Initialization.The experiments above also prove that Text Data Augmentation is more suitable for small datasets,and Word Embedding Initialization for larger ones.In conclusion,based on the Deep Learning methods and the Word Embedding models,this thesis makes full use of the feature expression ability of the learning framework,to study the problem of Visual Semantic Embedding,and excavates the potential semantic structure information in the image space and text space.At the meantime,extensive experiments and applications have proved the effectiveness of our learning framework in Image Annotation,Image Search and Image Similarity Detection.
Keywords/Search Tags:Visual Semantic Embedding, Deep Learning, Word Embedding, Convolutional Neural Networks, Recurrent Neural Networks
PDF Full Text Request
Related items