Font Size: a A A

Research And Method Of Text-Image Summerization Based On Multimodal Neural Network

Posted on:2022-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:L HeFull Text:PDF
GTID:2518306341453704Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Generally,most of the existing research on automatic summarization methods focus solely on the text field or the image field only.With the rapid growth of multimedia data on the Internet,multimodal summarization has gradually drawn widespread attention.Existing experiments have proved that,compared with text summarization,multimodal summarization can improve the quality of generated summarization compared with text summarization by using image feature information in the visual modality.Besides,multimodal summarization output can significantly improve users'satisfaction with summary information.In recent years,researchers have begun to study multimodal news summarization to generate multimodal output,which can be called Multimodal Summarization with Multimodal Output(MSMO).Researchers from the Chinese Academy of Sciences have released corresponding MSMO dataset.The latest research results are based on the pointer generator network.By introducing image attention and multi-modal attention mechanisms,using data extension and introducing image loss,the current best results have been achieved on the MSMO dataset.Different from the previous method of data extension using a single rule,this thesis proposes a data extension method based on statistic model,which concerned text-image relevance and image importance,effectively expands the image annotation data in the training data of the MSMO dataset.The experiment indicates that image location information is an important feature in the image summarization task,which proves the effectiveness of the data extension method.This thesis proposes a novel framework of multimodal summarization multimodal output task,which is based on the text-based Sequence to Sequence(Seq2seq)framework.The thesis decouples traditional Sequence to Sequence framework,connects the encoder and the decoder with multi-modal interaction layer,which learning the relevance of image-text domain information.This framework has high flexibility,can inherit the structure and parameters of other text Seq2seq models like pretrained language model,and support different image encoders and decoding methods.This thesis uses the state-of-the-art of generative pretrained language and pretrained vision model in experiments.The thesis achieved best results in the text summary ROUGE metric and the image accuracy.
Keywords/Search Tags:text-image summarization, multimodal embedding, dual-stream attention, deep learning, sequence-to-sequence model
PDF Full Text Request
Related items