In the era of artificial intelligence,reinforcement learning is an important means to endow an agent with autonomous decision-making ability in an open and changeable physical environment.However,because reinforcement learning requires a lot of interaction between agents and environment in the training process to learn effective policies,it is difficult to apply reinforcement learning in many real physical scenarios.Considering that the simulator has the characteristics of low sampling cost,we use the simulator for assistant training of policy.The complex and changeable open environment and unavoidable simulation errors increase the difficulty of policy reuse,the performance of the trained policies in the simulator is often greatly degraded when they are executed in the target environment,which makes it impossible to accomplish the task objectives.Based on this,this paper proposes a real-time evaluation method for execution-time policies to determine whether the current state scenario is suitable for the execution of policies to accomplish task objectives.For those scenarios that are not suitable for the execution of policies,human beings are warned and transferred to human experts to take over the current scenario.Furthermore,this paper proposes an execution-time policy evolution method,which enables the trained policies in the simulator to self-evolve according to the target environment,so as to finally adapt the target environment to achieve the task objectives.After studying the core issues in the methods of policy evaluation and policy evolution in the open environment,the following results have been achieved:1.In the aspect of execution-time policy evaluation,aiming at the problem that the performance of policy reuse in target environment is greatly degraded due to simulation errors of simulators,a policy confidence evaluation method based on generative adversarial learning(CEPO)is proposed,which enables the policy to be executed in target environment in real time by both the confidence network and the policy,and the confidence level is used to evaluate the performance of the policy reuse in target environment.The network gives the confidence evaluation of the current environmental state.When the confidence evaluation is greater than the threshold,the policy controls the current state.When the confidence evaluation is less than the threshold,the human will be warned and the human expert will control the current state.This method is especially suitable for the complex scenarios with high security and stability requirements(such as automatic driving).Experiments show that this method can effectively give a higher confidence evaluation value for the better performance of the policy,and a lower evaluation value for the worse performance of the policy or the environment state that has never been seen before,so that it can release human hands to a certain extent on the premise of ensuring security and stability.2.In the aspect of policy evolution,a policy evolution method based on action calibration(POSEC)is proposed to solve the problem of poor evaluation of policy in target environment after policy evaluation,which can execute only a few calibrated action sequences by the agent in the target environment.It can sense the current environment and extract environmental features to guide the policy evolution by executing only a few calibration actions in target environment.The self-evolution of policy reuse can quickly adapt to the current target environment,so that effective action control can be directly carried out in the target environment to complete the task.It has been experimentally verified that in environments with a variety of different parameter configurations,compared to the policy learned from scratch directly in the target environment(requires millions of samples),our method can quickly evolve an effective policy by using only 5 samples in target environment during policy reuse.At the same time,after repeated experiments,the method can confirm the stability and effectiveness of the calibration actions.The evolved policy can also be retrained as a better initial policy to continue to improve the performance in the target environment. |