Font Size: a A A

Research On Dialog Generation Methods Based On Proximal Policy Optimization And Adversarial Learning

Posted on:2021-12-10Degree:MasterType:Thesis
Country:ChinaCandidate:Y CaiFull Text:PDF
GTID:2518306200953489Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Dialogue generation is the key research direction of natural language processing.With the rise of deep learning,dialogue generation has developed rapidly.However,there are some challenges for dialogue generation.First,the grammatical structure of the generated reply may be incorrect that is inconformity to human grammar;second,it is easy to obtain a boring and blank reply;finally,you will face the problem that the dialogue has no contextual relevance and get a reply that is not in context.This thesis proposes a dialog generation method PPO?GAN based on proximal policy optimization(PPO).This method adopts the framework of generative adversarial nets(GAN),in which the generative model is used to generate dialogues,and the discriminant model is used to distinguish generated dialogues from real dialogues.The proximal policy optimization is used to train the generative adversarial nets,which can handle the case where the backpropagation of the generative adversarial nets during the generation of the dialog is not differentiable.While ensuring that the generation model is monotonic and non-reduced,it limits the gradient of the generation model iteration and make the obtained rewards by the discriminative model can be reusable.The main work of this thesis is: 1)Training a sequence-to-sequence model with attention mechanism as a generative model of adversarial generative network for generating dialogue;2)Training a hierarchical neural network as the discriminatory of adversarial generative network.The model is used to distinguish the real conversation from the conversation generated by the generative model.3)Iteratively trains the generative adversarial nets using the proximal policy optimization.During the adversarial training process,the Monte Carlo algorithm is used to calculate the reward for each word.The innovation of this thesis is: the model training method has been improved,the architecture of the generative adversarial nets is combined with the progress of the reinforcement learning direction,and the proposed dialog generation method PPO?GAN is proposed.This method is compared with the maximum likelihood estimation of the classic algorithm of dialog generation in the open field.It has a discriminant model that can better guide the training of dialog generation.Compared with the recently proposed adversary training method Adver-REGS,it not only obtains the direction and step size of the parameter update of the generated model by optimizing the proxy objective function with a penalty term,which ensures that the training of the generated model is monotonous and non-decreasing.The discriminative model returns the utilization of rewards.This thesis uses a common data set for training in open domain dialogue generation models.It evaluates the efficiency of model training through the rate of loss of convergence during model training,and evaluates the quality of dialog generation through the degree of confusion,the frequency of boring responses,and examples of dialog generation.Compared with the maximum likelihood estimation algorithm and the Adver-REGS algorithm of open field dialogue generation training,the PPO?GAN algorithm improves the efficiency of dialog training and improves the quality of dialog generation.
Keywords/Search Tags:dialog generation, proximal policy optimization(PPO), reinforcement learning, generative adversarial nets(GAN), sequence-to-sequence model
PDF Full Text Request
Related items