Font Size: a A A

Research On Environment Adaptive Reinforcement Learning Methods

Posted on:2022-08-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y X WangFull Text:PDF
GTID:2518306323462454Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Although reinforcement learning has been successfully applied in many areas,the applications of reinforcement learning are still limited by the sparse reward problem and the environment non-stationarity problem.The performance of reinforcement learning strongly depend on how well the reward signal frames the goal of the application's de-signer and how well the model addresses the environment non-stationarity problem.This essentially reflects the accuracy of environment modeling and the stability of the optimization process.Reward function adaptation and environment dynamics adap-tation are critical parts of reinforcement learning applications on non-standard envi-ronments,which requires that the algorithm can automatically design reward function and adaptively train in the complex environment.From the perspective of environ-ment modeling and solving process,we conducts research on environment adaptation methods of reinforcement learning through reward function adaptation and environment dynamics adaptation.From the perspective of environmental modeling to deal with the challenge of reward design,we propose the Motivation-Based Reward Design(MBRD)method.MBRD introduces the concept of motivation which captures the underlying goal of maximizing certain rewards.The basic idea of MBRD is to automatically generate goal-consistent intrinsic rewards for the agent to learn by minimizing the distance between the intrinsic and extrinsic motivations.The core of MBRD is to solve two problems:how to map the reward function to motivation and how to measure the distance between motivations.MBRD provides the ability to improve the reward function based on train-ing dynamics.We conduct extensive experiments in three Grid-world environments and two MuJoCo environments,show the advantages of MBRD method in handling prob-lems of delayed reward,exploration,and credit assignment.From the perspective of the solving process to deal with the challenge of environ-ment non-stationarity,we propose the Policy Adaptive Multi-Agent Deep Deterministic Policy Gradient(PAMADDPG)method.We model the environment non-stationarity with a finite set of scenarios and train policies fitting each scenario.In addition to mul-tiple policies,each agent also learns a policy predictor to determine which policy is the best with its local information.The core of PAMADDPG is to solve two problems:how to train multi-policy agents and how to choose the execution policy.PAMADDPG pro-vides the ability to train stably under unstable environment dynamics.We empirically evaluated our method on three Multi-Agent Particle Environment and show that the PAMADDPG method performs better than the baseline methods on mixed cooperative-competitive domains and a fully cooperative domain.
Keywords/Search Tags:Reinforcement Learning, Reward Design, Multi-Agent System
PDF Full Text Request
Related items