Font Size: a A A

Extractive Automatic Text Summarization For Long Sequences

Posted on:2024-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:C L HanFull Text:PDF
GTID:2568307151967409Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the leapfrog development of information technology,people are exposed to an explosion of textual information.In order to obtain key information efficiently and accurately,how to summarize the core content of a large amount of text has become an urgent need nowadays.Automatic text summarization is the task of automatically compressing long documents into relatively short texts while retaining most of the key information,and is widely used in multiple scenarios such as news headline generation,book summaries,review selection,and extraction of key information from conversational texts.In this paper,we investigate the English long-sequence summarization task from the perspectives of ensuring the information integrity of longsequence texts and learning long-distance dependencies of documents.First,due to the maximum input length limit of pre-trained language models,a common practice for handling long input sequences is to truncate them.This prevents the summary model from accessing some of the labeled sentences,which inevitably leads to information loss.To address this problem,this paper proposes an extractive summarization framework based on a federated encoding approach to complement the missing information.This approach ensures the integrity of long-sequence input by encoding sentences independently and solving the information silo problem caused by independent encoding by information fusion encoding.The approach designs a separate module built from LSTM networks to capture document contextual information.The pre-trained language model focuses on local encoding of individual sentences,thus avoiding the complex process of global information modeling using pre-trained language models while preserving the integrity of the input information.Finally,an experimental comparative analysis of the proposed method on Multi-news,Pubmed(trunc)and CNN/DM datasets is conducted in this paper to verify its effectiveness and superiority.Second,for long-sequence text summarization tasks,it is especially important to capture the long-range dependencies in documents.However,current Transformerbased pre-trained language models are weak in this aspect,e.g.,BERT models pretrained by sentence pairs are not suitable for modeling document-level relationships.Therefore,this paper propose an extractive summarization framework based on topic model guidance.The framework discovers the potential topics of documents by topic models,and mines the deep semantic links between topic information and document sentences by heterogeneous graph neural networks to better capture the long-range dependencies between sentences.The framework also designs an additional memory unit module,and formulates the extraction process as a reinforcement learning paradigm to alleviate the topic repetition in topic selection and the discrepancy between training and testing of extraction models.Finally,the proposed framework is experimentally compared and analyzed on ar Xiv and Gov Report datasets in this paper to verify its effectiveness and superiority.
Keywords/Search Tags:Extractive summaries, long-sequence text, long-distance dependencies, topic models, heterogeneous graph neural networks, reinforcement learning
PDF Full Text Request
Related items