A dialogue system is an intelligent agent that interacts with humans using natural language.Among them,the goal of a task-oriented dialogue system is to help users complete specific tasks,and the actions of the dialogue system are provided by the dialogue policy.With the continuous development of artificial intelligence,reinforcement learning has been widely used in the training of task-oriented dialogue systems.Reinforcement learning learns the dialogue policy by interacting with users,and can dynamically adjust the policy based on user feedback.Therefore,it can adaptively learn in different dialogue tasks.However,in actual training,using real user interaction incurs huge costs,and user simulators are generally used instead of real users.Due to the lack of language complexity of real users,biases in the design of user simulators may lead to low efficiency and insufficient stability in learning dialogue policies.Therefore,improving the efficiency and quality of policy learning in limited dialogue experience is crucial.This thesis mainly studies the dialogue policy and dialogue response generation tasks in task-oriented dialogue systems.The research content mainly includes the following three aspects.(1)Adversarial Curriculum Method based on Automatically Generated Task Objectives.To address the problem of limited successful dialogue samples in reinforcement learning that affects learning efficiency,this thesis proposes an Adversarial Curriculum Method for Dialogue Policy Learning(ACM-DPL)based on automatically generated task objectives.ACM-DPL is a curriculum learning framework based on generative adversarial networks,consisting of two modules: a teacher module and a student module.The teacher network uses a generator to generate task objectives,i.e.,problems to be solved by users,and provides them to the student module to learn dialogue policies.The student module learns dialogue policies based on interaction experience,constantly improving its interaction skills.Compared with randomly sampling user objectives from a dataset,ACM-DPL presents tasks to the dialogue agent in the best possible way,gradually increasing the difficulty of task objectives during training,and improving the learning performance of the dialogue agent.At the same time,the objectives generated by the generator are only slightly more difficult than the objectives learned by the student network,helping the student network obtain more successful dialogue samples and improving the learning efficiency of the dialogue agent.(2)Dialogue Policy Learning Based on Decision Sequences Model(DSM-DPL).To address the problem of high training costs in practical environments,this thesis proposes Dialogue Policy Learning Based on Decision Sequences Model(DSM-DPL).DSM-DPL is a combination of offline reinforcement learning and sequence modeling.It treats dialogue as a sequence problem,and trains a Transformer model to learn the dialogue policy using sequence modeling targets based on the dialogue experiences collected by offline reinforcement learning.Specifically,DSM-DPL uses a causally masked Transformer to generate dialogue actions,adjusts its autoregressive model based on expected reward,past states,and actions,and outputs the optimal action.DSM-DPL generates the most effective behavior from fixed and limited experience,avoiding the serious costs of collecting dialogue data frequently from the environment.Meanwhile,its way of generating dialogue actions can make the dialogue more diverse,and the dialogue system has stronger robustness.(3)Dialogue policy learning method based on user model discrepancy.To address the stability issue of dialogue systems,this thesis proposes a dialogue policy learning method based on user model discrepancy(MDQ).In reinforcement learning training,multiple dialogue environments are commonly used to train agents,and the dialogue agent interacts with a set of diverse user models learned from user simulators.However,the quality of different user models may vary greatly,and erroneous learning in the user model may lead to a dialogue policy that does not match real-world dialogue.MDQ constructs an optimization problem to jointly optimize the dialogue policy and the user model sampling distribution.It incorporates a user model evaluator to assess the quality of user models and selects high-quality user models for training to produce a dialogue agent that better aligns with human dialogue habits. |