Reinforcement learning enables an agent to learn skills through its interaction with environment.The agent tries to learn how to map the situations of the environment to its actions so as to maximize the rewards from the environment.Deep learning provided a performance boost to reinforcement learning,but also caused problems.The thesis has three parts for three key problems in reinforcement learning algorithms:(1)the negative effects caused by approximation error;(2)highly sample-dependent;(3)the unstability of RL algorithms.The first part of the thesis includes Chapters 3 and 4.Chapter 3 presents a theoretical analysis of the convergence of actor-critic methods and concludes that a sufficient condition of convergence of the algorithms is difficult to meet when function approximation methods are used to approximate value function.That means the approximation error in the value function not only causes an overestimation phenomenon but also has a negative effect on the convergence of the algorithms.Chapter 4 proposes an effective method to mitigate the approximation error in the value function.Chapter 4 provides an upper boundary of the approximation error of Q function approximator and then concludes that the error can be lowered by keeping the similarity of every two consecutive policies during the training phase of the policy.Based on this conclusion,a new RL algorithm called error-controlled actor-critic(ECAC)is proposed.The results of ablation studies verify the correctness of the conclusion,and the results of comparative evaluation demonstrate that ECAC significantly outperforms other model-free RL algorithms.The second part of the thesis(Chapter 5)proposes a robust sample-guided training method to reduce the required amount of sampling.To increase the robustness of samples,during the process of obtaining demonstration from an expert,noises are injected into the actions provided by the expert.Besides,in contrast with the pretrain method,the samples guided method uses the robust samples to guide the whole training process instead of training for initialization of the parameter of policy.The experiment results show that the noised samples are more efficient and the samples guided training method outperforms the pre-train method.The third part of the thesis(Chapter 6)is concerned with making reinforcement learning algorithms more stable.RL algorithms use only one agent to explore the environment,it’s hard to guarantee the sample diversity which determines the quality of samples.Furthermore,RL algorithms are sensitive to hyper-parameters.It’s a reliable way to hybridize RL with evolutionary algorithms(EAs).Chapter 6 proposes a framework called competitive swarm reinforcement learning(CSRL)which is a hybrid of RL and EA to ensure the robustness of RL algorithms.The framework runs RL and EA in turns.Agents in the same swarm share samples and the difference among their ways of exploring ensure the sample diversity.During the RL training process,different policies are trained using different hyper-parameter so as to make the algorithms insensitive to hyper-parameters.The results of comparative evaluation demonstrate that CSRL significantly outperforms other similar frameworks. |