Font Size: a A A

Research On Key Technologies Of Reinforcement Learning Algorithms For Continuous Action Space

Posted on:2023-01-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:M H WuFull Text:PDF
GTID:1528306944456494Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of artificial intelligence and the improving need for social automation,the new generation of control technology based on artificial intelligence algorithms will have a far-reaching impact on the development of society.The neural network system can accurately establish the relationship between input and output,and it has achieved great success in the pattern recognition field.Among them,reinforcement learning(RL)technology has won the favor and extensive research by researchers in robotics because of its interactive learning method,independence of labeled data sets and manual intervention in the training process.Most robots are in continuous action space in practical applications,such as controlling direction,angle,and speed.Their output is in a continuous interval rather than several discrete values.Therefore,it is more practical to study the key technologies of reinforcement learning in the continuous action space.Therefore,the dissertation takes the reinforcement learning algorithm in continuous action space as the research object,obstacle avoidance and control tasks as the main test scenarios of the algorithm,efficiency improvement of the algorithm in the training process as the main purpose.The dissertation makes an in-depth study from four aspects:action quality value estimation,multi-action cooperative control,exploration strategy design,and reward sparsity suppression.The main research contents of this paper are summarized as follows:(1)Aiming at the inaccuracy of the estimation of action quality value in continuous action space,an Actor-Dueling-Critic(ADC)algorithm based on a dueling network is proposed.In the training process,inaccurate action quality value estimation will lead to an unstable training process and a longer convergence time.The improvement of state uncertainty will further affect the accuracy of action value estimation.To improve the stabilit and shorten the convergence time in the training process,the concept of action advantage interval is proposed,and the actor-critic algorithm is optimized.In training,the action advantage interval value will be independent of the environmental state,evaluate the action interval of the agent separately,and reduce the impact of the state value on the quality value.This method can better guide the agent to use the relative advantage of action for efficient learning,and the training process is easier to converge.(2)Aiming at the multi-dimensional continuous space action cooperative control,a multidimensional action control(MDAC)algorithm based on hierarchical reinforcement learning is proposed.Reinforcement learning algorithms are mostly designed for single dimension actions.If the action dimensions processed by the algorithm can be expanded,its task execution ability will be enhanced.Based on the feUdal network,the algorithm sets up the manager and worker,in which the manager is responsible for assigning tasks,and the worker performs the corresponding actions according to the environmental information and the manager’s command.The agents trained by this method consider the correlation between actions.(3)Aiming at the low efficiency of exploration strategy for reinforcement learning algorithm in continuous space,a dynamic exploration policy(DEP)based on Epsilon-Greedy algorithm is proposed.Exploration and utilization in reinforcement learning algorithms is a long-standing contradiction.How to balance the relationship between them needs to be deeply studied.Suppose the exploration values in different states can be adjusted according to the environmental conditions.The exploration strategy will be more intelligent,especially in an environment with an extensive range and more characteristics.This method can enable the agent to understand the environment and learn the control strategy quickly.(4)Aiming at the reward sparsity in large-scale environment,a reward sparsity suppression technology based on an auxiliary network is proposed.In the large-scale environment and the target area is far away,the sparsity of reward makes the agent unable to complete the task in a short time and then unable to learn the control strategy effectively.This paper proposes an auxiliary network method,which decomposes the main task in the environment into a combination of multiple sub-tasks,and uses the auxiliary network to complete the sub-objectives.Under this method,the agent has a clearer guidance signal in the exploration process and can complete the main tasks faster.
Keywords/Search Tags:Reinforcement learning, Continuous action space, Neural network, Sparse reward
PDF Full Text Request
Related items