Font Size: a A A

Multi-label Classification Of Captioned Images Based On Deep Learning

Posted on:2021-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:J T CaiFull Text:PDF
GTID:2428330623467818Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The multi-label image classification task is to correctly recognize a variety of object types contained in natural images.Image classification is not only one of the foundations of computer vision,but also has a wide range of applications in real life.But multi-label images generally contain more targets,and there are problems such as occlusion between targets,large differences in target sizes,and complex image content,etc.It is a challenging task to accurately classify them.Vision and language are the two core parts of human beings to solve real problems,so artificial intelligence has conducted a lot of research in the two areas.In recent years,due to the huge progress of deep learning in their respective fields,the boundary between vision and language has been broken,which makes cross-modal fusion a hot issue of current research.Many studies have shown that multimode often perform better than single-modal algorithms by adding more modal network models,.In the framework of multi-modal fusion,how to effectively allow text information to assist in multi-label classification of images becomes the key problem.This thesis proposes a new short video cover data set,which contains multi-tag images and title text information attached to the images.The validity of the image multi-label classification algorithm combining visual attention mechanism and multi-modal fusion is verified through this data set.The main contributions are as follows:1)This thesis improves the existing image attention mechanism and proposes a stereo attention mechanism.Most of the existing attention mechanisms focus on the spatial features of the feature image,and ignore the information in the channel direction.By combining the spatial attention mechanism and the channel attention mechanism,we have fully considered the characteristics of the spatial position and the channel position.Especially,the spatial attention mechanism acts on the lower layers of the network and can pay more attention to detailed information on feature maps with higher resolution.In addition channel attention mechanism can be considered as the choice of attributes.And experiments show that it has a good effect on the two data sets.2)This thesis uses a hierarchical multi-label classification algorithm to establish the relationship between label subclasses and parent classes to assist the model to get all the labels of the image.The algorithm can optimize local and global loss functions todiscover class relationships and global information at the local level from the entire class hierarchy,while penalizing hierarchical misclassifications.Through analyzing the experimental results,our proposed algorithm can establish the implicit connection among the tags.3)This thesis introduces title text information into image multi-label classification algorithms.By using the hidden relationship between text information and images,multi-modal fusion is performed by making the image features to pay more attention to the areas where the text information pays attention for helping classify the images.Firstly,text information is focused on the keywords in the sentence through the self-attention mechanism,and then the text features and image features are fused through the bilinear attention mechanism network.Subsequently,the output feature vector and the image feature vector are used to prevent the noise information of the text for multi-label classification.Finally,a large number of experiments verify the effectiveness of the proposed method.
Keywords/Search Tags:Deep learning, Multi-label Image Classification, Attention Mechanism, Multi-modal Feature Fusion
PDF Full Text Request
Related items