A Research On Abstract Summary Extraction Of Long Texts Based On BERT Model

Posted on:2022-05-12

Degree:Master

Type:Thesis

Country:China

Candidate:S Y Ji

Full Text:PDF

GTID:2518306524951559

Subject:Industrial Engineering

Abstract/Summary:

PDF Full Text Request

With the development trend of Intelligent Manufacturing in China,man-machine intelligent interaction has become one of the core tasks.In the process of man-machine intelligent interaction,there is the reading and transmission of text data,which usually contains a large number of long text information.In order to transfer and interact key information efficiently,it is necessary to filter and summarize the important content.The method used is called abstractive summarization.However,text summarization is mainly used to extract short text,but there is little research on long text.The length of the text affects the quality of the abstracts generated by the network model.The long text contains more information.Expanding the length of the input text can obtain more complete text semantic relations.Therefore,improving the accuracy of long text summarization becomes one of the research contents of summarization extraction task.The dataset and the pre-trained BERT model were improved and researched in this thesis.For the dataset,it was mainly based on the long document dataset CNN/Daily Mail.The whole training process did not select the short document,but used all the document,which can be improved to be suitable for the new model.Firstly,Doc2 vec was used to train the document vector.Secondly,split the long document,and the document vector was added to each split sentence embedding to ensure the overall semantic relevance,which greatly retained the semantic information in the original document and made it associated after splitting.Finally,sentence vector combined position coding and attention mechanism,position factor was added to improve the whole word embedding vector.For the model,the structure of the BERT model was improved for the new data.Three BERT encoders were stacked at the same time.Each encoder was input a part of the original long text and combined with the document vector obtained by training.The corresponding output text vector was obtained,which was transmitted to the decoder,and then combined with the summary content generated by each decoder,the overall summary representation is obtained.The improved model was used the idea of splitting to reduce the dimension of each incoming data and retained the information related to the long text.The task of generating the long text summary was completed without increasing the operation cost.Finally,a model algorithm for abstractive summarization of long document based on improved BERT model was proposed in this thesis.The improved model was evaluated by using two kinds of data forms: segmentation dataset and combination dataset.The comparison test results show that,the accuracy of the abstracts generated by the proposed model architecture is improved in different evaluation indicators,which verifies the effectiveness of the proposed method and model when the input text is long in the abstractive summarization extraction task.

Keywords/Search Tags:

Abstractive Summarization, Natural Language Processing, BERT, Long Document, Neural Network

PDF Full Text Request

Related items

1	A Transferable Approach To Generating Abstractive Text Summary Based On Pre-trained Language Model
2	Research On Abstractive Summarization Technology Of Public Opinion News Based On Element Extraction
3	Submodularity in Natural Language Processing: Algorithms and Applications
4	Automatic Summarization Of Multimedia Information And Related Technology Research,
5	Scalable Multi-Document Summarization Using Natural Language Processing
6	BERT-based Two-stage Long Document Retrieval Model Fused With Supplementary Information
7	Research On Abstractive Text Summarization Model Based On Transformer
8	Research And Application Of Document Semantic Representation Method
9	Research On Abstractive Text Summarization Method Based On Hybrid Neural Network
10	Statistic-based Automatic Keypharse Extraction And Summarization From Multi-document