Font Size: a A A

Research On News Text Summarization Algorithm Based On Pre-trained Language Model

Posted on:2022-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z P FengFull Text:PDF
GTID:2518306779996409Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet in the new era,all kinds of data are showing explosive growth.Among them,news text data has been widely infiltrated into public life with the popularization of push,streaming media platforms and social media.People are eager to find an effective way to alleviate the problem of information overload.In recent years,the development of deep learning has greatly promoted the research of automatic text summarization techniques.However,the existing methods still have some shortcomings.The news text summarization algorithm for Chinese corpus does not make full use of the Chinese word segmentation features,and the training target of the summarization algorithm lacks consideration of the overall semantic relevance,resulting in poor summary coherence and other problems.Ultimately affects the accuracy and readability of text summaries.Based on the above shortcomings,the main research contents of this thesis are as follows:The automatic text summarization algorithm proposed in this thesis adopts the Encoder-Decoder architecture,and the encoder adopts the pre-trained language model BERT-wwm-ext,which can better capture the overall semantic information and effectively solve the problem of long text context dependence.The decoder part consists of multiple stacked Transformer decoding units.The encoder and decoder use the Transformer connection to achieve parallel computing and further improve training efficiency.A semantic evaluation module is added between the encoder and decoder,which increases the consideration of semantic relevance in the training process.The training process is based on a pre-trained language model and fine-tuned with generative text summarization as a downstream task.In order to make full use of the word segmentation features in the Chinese news corpus.This thesis proposes a method of adding word segmentation feature embedding to the pre-training language model,so that the pre-training language model can learn the necessary word segmentation features in the source text during training,and provide more valuable semantic vectors for the decoding process.In order to match the original input structure of the pre-trained language model,an encoding alignment algorithm is designed to generate the corresponding word segmentation embedding encoding.In order to further improve the accuracy and readability of text summaries,and alleviate the problem of semantic expression deviation of summary content.This thesis proposes a mixed-target training method with semantic evaluation.The cosine similarity distance is calculated according to the semantic vector of the standard abstract and the semantic vector of the generated abstract,so as to correct the learning direction of the model,so that the algorithm should pay attention to the semantics of the overall content while pursuing the literal approach of the abstract.compatibility.This thesis conducts experiments on the large-scale Chinese short text summarization dataset LCSTS.The results show that the algorithm can achieve better summary quality with less training time and training data,which reflects the strong generalization and efficiency of the algorithm.Among them,the proportion of phrases and the accuracy of phrases in the generated abstract are increased by 2.34% and 1.69% respectively after the word segmentation feature embedding is incorporated.At the same time,this algorithm proposes a hybrid target training method with semantic relevance evaluation,which can effectively alleviate the problem of semantic expression bias..Compared with similar summary algorithms,this algorithm achieves the best results on the evaluation indicators ROUGE-2and ROUGE-L,which proves that this algorithm can effectively improve the accuracy and readability of text summaries and improve the quality of news text summaries..
Keywords/Search Tags:pre-trained language models, news text summary, word segmentation feature embedding, mixed target training
PDF Full Text Request
Related items