Task-oriented dialog systems are widely employed in daily life to complete specific tasks.Dialog policy module guides the direction of the conversation,determines the user experience and the success of the dialog.It is an important component of task-oriented dialog systems.Existing research about dialog policy module often trains the module separately or trains the module with other modules of the dialog system synchronously.The former ignores the influence of other modules on the dialog policy,resulting in a lack of fault tolerance for other modules.The latter is unstable due to the mutual influence between modules during training.To address these issues,this paper conducts research on dialog policy module based on multi-agent reinforcement learning(MARL)after extensive investigation.The specific work is as follows:An asynchronous reinforcement learning framework is proposed to train task-oriented dialog system.Asynchronous updates refer to using different update frequencies for the dialog policy module and other modules during joint training.The framework has the following characteristics:Firstly,the dialog policy module is integrated into the entire dialog system for modeling and training,and the dialog state tracking module and dialog policy module are updated asynchronously,alleviating the mutual influence between different modules.Secondly,curriculum learning is introduced to adjust the training samples and process of the dialog state tracking module.Based on the accuracy of the dialog state tracking module under different user actions,the user actions are divided into three levels from easy to difficult,and targeted training is conducted for samples that are difficult for the model to learn.Thirdly,a user model is constructed to assist the training of the dialog system,which improves the diversity of users and the diversity of dialog data collected by the system,making the training more sufficient.At the same time,the existing reward design is improved,and both the user model and the dialog system model are trained using reinforcement learning,which improves the accuracy of the user model while training the dialog system.This paper further constructs a dataset for collecting phone numbers,which has the characteristics of sub-slot based task-oriented dialog task and is more realistic than existing dialog data,with more complex dialog actions,making it the most action-rich single-domain task-oriented dialog dataset.Experimental results on this dataset show that the proposed method has better performance than existing typical reinforcement learning methods.Finally,a dialog system application is designed and implemented based on the dialog policy model and training method proposed in this paper.The system has been partially deployed in an actual customer service system for testing,and the results show that the system can effectively complete the task of collecting complex user phone numbers through multi-turn dialog. |