Dialogue systems,as the core gateway for future human-computer interaction,provide a natural language-based interface between humans and computers.Among them,task-oriented dialogue systems which aim to help users perform tasks in a specific or multiple domains,have been widely used in a variety of service scenarios and gradually developed into a gateway for future application services.Dialogue policy determines the system response and controls the dialogue processes,and is a key module of a task-oriented dialogue system.Deep reinforcement learning provide good theoretical support for modeling dialogue policies,and are the mainstream techniques at present.However,due to the flexibility of natural language,humancomputer dialogue scenarios usually have a larger dimension of state space and action space than classical reinforcement learning problems,resulting in deep reinforcement learning-based dialogue policy learning usually suffers from the core problems of low learning efficiency.Therefore,it is of great theoretical value and guidance to explore and construct an efficient dialogue policy model for realizing highly available task-oriented dialogue systems.In this thesis,we focus on task-oriented dialogue policies,analyze and address the above core technical problems,and conduct a series of studies in terms of training and modeling.The main work and contributions of this thesis are as follows:(1)From the training aspect,a novel Automatic Curriculum Learning-based Deep QNetwork(ACL-DQN)is proposed in this thesis to improve the learning efficiency of dialogue policies by combining curriculum learning.ACL-DQN replaced the traditional random sampling approach using an RL-based teacher model to customize a reasonable curriculum for dialogue policy(refer to student model),enabling automatic curriculum policy learning for the first time.Experiments show that the ACL-DQN significantly improves the effectiveness and stability of dialogue tasks with a statistically significant margin.Furthermore,the framework can be further improved by equipping with different curriculum schedules,which demonstrates that the framework has strong generalizability.(2)In order to address the lack of reliable task difficulty evaluation and the high cost of curriculum sequencing in existing curriculum policy learning methods,this thesis presents a novel versatile adaptive curriculum learning(VACL)framework.The framework defines an evaluation method that only requires the learning experience of dialogue policies to assess the difficulty of dialogue tasks accurately.To reduce the cost of curriculum sequencing,the framework further explores its versatility in that a generic,resilient global curriculum can be constructed while training a wellperforming dialogue policy.The strengths and versatility of the framework are validated in three publicly available dialogue task datasets.(3)From the model aspect,in order to complement the missing fast learning ability of the dialogue policy model,this thesis proposes a novel complementary policy learning framework(CPL),which simulates the complementary learning mechanisms of the human brain neocortex and hippocampus,and thus exploits the complementary advantages of the episodic memory(EM)policy and the deep Q-network(DQN)policy,which guide each other during the strong period of their relative performance to achieve fast and effective dialogue policy learning.Experimental results on three dialogue datasets show that our method significantly outperforms existing methods relying on a single learning system.(4)In order to address the problems of high memory or design costs and suboptimal guidance in existing dialogue policy learning methods with guidance(including CPL),this thesis proposes an automatic error detection and recovery algorithm,which corrects and recovers the conversation at the critical moment when it detects that the dialogue policy has made a crucial mistake decision,while ensuring model exploration.The AEDR is regardless of state and action space size and does not require human intervention,improving the criticality and effectiveness of automatic decision guidance.Experiments have shown the effectiveness and generality of our method. |