Font Size: a A A

Research On Text Summarization Technology With Controllable Length

Posted on:2022-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:T C XiaoFull Text:PDF
GTID:2518306572950909Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the advent of the Internet information age,the relationship between people and information has undergone significant changes.As more and more information is available,it becomes easier and easier to get in touch with new information.Therefore,how to extract central content from massive data has become one of the keys to improve people's working efficiency and living standard in the new era.Text summarization based on natural language processing is an important technique to solve this problem.Text summarization technology requires the training model based on mass corpus to input long text into the model,and the model outputs text summaries highly summarizing the original long text.On this basis,controllable text generation has a relatively wide range of application scenarios,this study focuses on the length of controllable generative text summary technology research.Because screen sizes vary from device to device,the best display results in different numbers of words.It is worth exploring and researching that the model can flexibly generate text summaries of different lengths to adapt to different electronic screens.Because the abstracted abstract format is relatively rigid and does not generalize,the generative abstract with high flexibility and generalization has become the focus of this research.This topic will study three different types of generative text summaries with controllable length.Based on three methods,experiments were carried out in NLPCC17 Chinese dataset and CNNDM English dataset respectively,and the length controllable abstracts were successfully generated,and the Rouge indexes were basically in line with expectations.Based on the final experimental results,we compare the advantages and disadvantages of the three methods,and analyze the deep reasons.On the NLPCC17 Chinese dataset,we also manually annotated 500 pieces of data with different lengths of target summaries for subsequent evaluation.First,we modify the location encoding of the Transformer model to control the generation of digests with controllable length.Due to the advantages of Transformer model in translation and other generation tasks,we try to modify its location encoding based on Transformer model and introduce the length information of target abstract,so that the model will incorporate the length information into the final representation at the decoding end.Compared with the traditional length control method,the index is improved greatly.Due to the small Chinese dataset of NLPCC17,we also tried transfer learning by first training the model on CLTS Plus dataset,so as to improve the effect of the model on Chinese dataset.Then,due to the modification of the position code inside the large model,the uncertainty is stronger.Therefore,we tried to introduce keyword information without modifying the internal architecture of the pre-training model.We chose to extract keywords of the length of the target abstract from the long source text to guide the model content.Based on the BART model,the specially defined length interval marker and the extracted keywords are used as prefixes on the ENCODER end of the BART model to control the model to generate the summary of the corresponding length interval.We have tried several different methods of keyword extraction and found that the quality of keyword extraction has a great impact on the quality of the model output.Finally,considering the advantage of autoregressive model in speed,we try to generate a summary with controllable length based on non-autoregressive model.Different from the traditional model,which first predicts the length of the abstract,and the second method,which extracts the key words of the length of the target abstract first,and then inputs them as the decoder,realizes the non-autoregressions to generate the abstract.Since non-autoregressive models are not easy to converge,we will train a TEACHER model to assist model training.In addition,continuous repetition exists in the decoding results of the non-autoregressive model.We introduce cosine similarity loss into the continuous hidden representation,so that the continuous hidden representation difference of the model can be enlarged as much as possible to reduce the repetition phenomenon.
Keywords/Search Tags:abstractive summarization, length controllable, position embedding, keyword extraction, nonautoregression
PDF Full Text Request
Related items