Font Size: a A A

A Research On Abstract Summary Extraction Of Long Texts Based On BERT Model

Posted on:2022-05-12Degree:MasterType:Thesis
Country:ChinaCandidate:S Y JiFull Text:PDF
GTID:2518306524951559Subject:Industrial Engineering
Abstract/Summary:PDF Full Text Request
With the development trend of Intelligent Manufacturing in China,man-machine intelligent interaction has become one of the core tasks.In the process of man-machine intelligent interaction,there is the reading and transmission of text data,which usually contains a large number of long text information.In order to transfer and interact key information efficiently,it is necessary to filter and summarize the important content.The method used is called abstractive summarization.However,text summarization is mainly used to extract short text,but there is little research on long text.The length of the text affects the quality of the abstracts generated by the network model.The long text contains more information.Expanding the length of the input text can obtain more complete text semantic relations.Therefore,improving the accuracy of long text summarization becomes one of the research contents of summarization extraction task.The dataset and the pre-trained BERT model were improved and researched in this thesis.For the dataset,it was mainly based on the long document dataset CNN/Daily Mail.The whole training process did not select the short document,but used all the document,which can be improved to be suitable for the new model.Firstly,Doc2 vec was used to train the document vector.Secondly,split the long document,and the document vector was added to each split sentence embedding to ensure the overall semantic relevance,which greatly retained the semantic information in the original document and made it associated after splitting.Finally,sentence vector combined position coding and attention mechanism,position factor was added to improve the whole word embedding vector.For the model,the structure of the BERT model was improved for the new data.Three BERT encoders were stacked at the same time.Each encoder was input a part of the original long text and combined with the document vector obtained by training.The corresponding output text vector was obtained,which was transmitted to the decoder,and then combined with the summary content generated by each decoder,the overall summary representation is obtained.The improved model was used the idea of splitting to reduce the dimension of each incoming data and retained the information related to the long text.The task of generating the long text summary was completed without increasing the operation cost.Finally,a model algorithm for abstractive summarization of long document based on improved BERT model was proposed in this thesis.The improved model was evaluated by using two kinds of data forms: segmentation dataset and combination dataset.The comparison test results show that,the accuracy of the abstracts generated by the proposed model architecture is improved in different evaluation indicators,which verifies the effectiveness of the proposed method and model when the input text is long in the abstractive summarization extraction task.
Keywords/Search Tags:Abstractive Summarization, Natural Language Processing, BERT, Long Document, Neural Network
PDF Full Text Request
Related items