Font Size: a A A

Research On The Exploration Performance Of Policy Based On Actor-Critic Framework

Posted on:2021-05-17Degree:MasterType:Thesis
Country:ChinaCandidate:C WeiFull Text:PDF
GTID:2518306107959409Subject:Mathematics
Abstract/Summary:PDF Full Text Request
Reinforcement learning is a class of method which is used to deal with sequential decision problems.Combining with deep learning,reinforcement learning obtained further development.Through interacting with the environment,agent acquires more knowledge about the system and takes action according to the learned information.Exploration is a behavior that agent abandons current optimal action and tries action that it has not selected before,in order to obtain long-term reward.Effective exploration remains a key challenge in reinforcement learning,while it plays an important role in learning an optimal policy.There are methods which are frequently used,such as optimistic exploration,optimism in the face of uncertainty,probability matching,exploration based on information entropy.These algorithms were early proposed to handle Multi-Armed Bandits which only contains three items:action space,reward function and policy.Afterwards,these exploration methods were developed and made available for Markov decision process.The main purpose of this paper is to develop an algorithm which could gain a policy with good exploration performance and is appropriate to environments with continuous action space and multiple freedoms,such as robot control system.I proposed an algorithm which contains policy network called actor and policy evaluation network called critic.By alternately training these two networks,a convergent policy could be obtained.In the beginning,I constructed policy evaluation indicator ?(·).Then,I demonstrated the equivalence of minimizing the loss of Bayesian statistical inference and maximizing the posterior probability of critic network's parameters.This conclusion was helpful to combine the policy evaluation indicator with a probability distribution.So that,probability matching technique was possible to extend to environments with continuous action space.In order to guarantee the flexibility of critic network,I employed nested normalizing flows to estimate probability distributions.Besides,I trained critic network's parameters by minimizing energy function,which could avoid target function being stuck in a local optimum,and could make agent's exploration ability be further improved.Furthermore,the algorithm I proposed was suitable not only for environments whose state space is consist of pose but also for environments whose state space with images as the fundamental element.It was validated that agent was able to produce high reward in complex systems with the number of degrees of freedom by experiments.What's more,the algorithm proposed in this paper performed better than the state of the art algorithm TRPO,which explores the system by adding stochastic perturbation to a deterministic action,in the two simulation environments: Bipedal Walker Hard Core-v2 and Ms Pacman No Frameskip-v4.
Keywords/Search Tags:reinforcement learning, Markov decision process, exploration performance of policy, continuous action space, probability matching, nested normalizing flows
PDF Full Text Request
Related items