Font Size: a A A

Research On Text Keyphrase Generation Method Based On Pre-trained Language Model

Posted on:2022-12-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y WangFull Text:PDF
GTID:2518306746481324Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
The keypharse generation method can generate keyphrases representing the subject and main meaning of a text or a document.Most of the current keyphrase generation methods use the recurrent network structure,which has the problem of long-distance dependence of text,and its sequentiality also excludes the parallelization of training samples.At the same time,there are problems with inaccurate representation of text word embeddings,poor generalization performance,and high training cost.Advanced problems limit the performance improvement of text keyphrase generation.To address these issues,the paper has done the following research:(1)Aiming at the problems of long-distance dependency limitation and inaccurate word embedding representation,a text keyphrase generation model based on XLNet,Score XLNet,is proposed.This is an encoder-decoder framework that leverages the rich semantic features of XLNet trained on massive data to improve the performance of keyphrase generation tasks.The model first uses XLNet to extract important sentences,and then use the title to guide the pre-trained encoder to collect information about each word in important sentences.In addition,a character-level reinforcement learning reward mechanism based on phrase prefix matching is introduced to alleviate the inconsistency between training mode and testing mode.Experiments on five public data sets show that the algorithm can effectively alleviate the current problems and improve the performance of the model.(2)To solve the problem of keyword generation in high cost training and low resource scenarios,a keyword generation model Score XLNet-GAN based on pre-trained language model and generative adversarial networks structure is proposed.Both the generator and discriminator of this model are pre-trained language models.The generator generates a series of keyphrases for the input document,and the discriminator tries to distinguish machine-generated keyphrases from manual labeling keyphrases.At the same time,a new discriminator structure for sequence classification is proposed,which is based on an improved BERT pre-trained language model.Only 1% of the standard training dataset is used for learning,which makes the model work well.
Keywords/Search Tags:Keyphrase generation, Sentence extract, Pre-trained language model, GAN, RL
PDF Full Text Request
Related items