Font Size: a A A

State Estimation And Policy Learning In Partially Observable Markov Decision Processes

Posted on:2023-04-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:J F LiuFull Text:PDF
GTID:1520307034982069Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
In recent years,sequential decision problems in dynamic and uncertain environments have become a hot research topic in fields of artificial intelligence and control.The partially observable Markov decision process(POMDP)provides a unified description framework for such problems.Since the POMDP model takes into consideration the impact of state uncertainties on decision making,it can describe the real world more objectively and is widely used in scientific,industrial,commercial,military,and social domains.However,practical POMDP models have strong nonlinearity and stochasticity,which may suffer problems.For example,the observations are susceptible to the timevarying noise and outliers,high computational complexity is required in multi-agent cooperation scenarios,and the accurate analytical model of the system cannot be obtained.Therefore,they bring challenges to solve POMDPs.To handle the above problems,this dissertation aims to improve the performance and efficiency of solving POMDPs,and focuses on model-based state estimation and model-free policy learning,which are as follows:(1)To address the state estimation problem of low accuracy and poor convergence when time-varying noise and outliers appear in measurements of the continuous state POMDP with known models,an error state fuzzy adaptive Kalman filter based robust state estimation approach are proposed.In dynamic and uncertain environments,the statistical characteristics of the time-varying noise change frequently and greatly.Thus,the fuzzy inference system is adopted to estimate the innovation contribution weight in the measurement noise covariance estimator,so that the dynamic measurement noise characteristics can be quickly captured.To weaken the influence of outliers on the filter performance,the outliers are detected based on the innovation orthogonality and corrected according to their deviation degree.The simulation and physical experiment results show that the proposed state estimation approach has strong robustness and adaptive capability,which effectively improves the accuracy of state estimation in environments whose measurements contain the timevarying noise and outliers.(2)To balance the positioning accuracy and efficiency of multi-agent cooperative position estimation problems in the continuous state decentralized POMDP(DecPOMDP)with known models,two cooperative localization approaches for multi-agent systems based on the task prior are proposed.Specifically,the efficient use of task priors is taken as the core idea,and researches are carried out from the perspectives of modeling optimization and measurement processing respectively: First,the rigid constraints of the distance and orientation are introduced into the modeling process,and a two-agent cooperative localization approach which integrates the rigid constraint model and the quadrature Kalman filter is proposed.In this approach,the coupling relationship between agents is utilized to reduce the dimension of the estimated state,and the computational burden is effectively reduced while ensuring the estimation accuracy.Then,the anchor information is considered as the prior,and a node switchable cooperative localization approach mixing the pseudo-anchor cooperation and non-anchor cooperation is presented.The concept of temporary pseudo-anchors(TPAs)is introduced.Based on the idea of TPAs,the heterogeneous cooperative measurements are integrated by the node type switching mechanism to realize efficient utilization of measurements.To further extract useful information from redundant measurements,the TPA selection strategies are designed according to the information theory.The simulation results show that the tradeoff between the accuracy and efficiency of cooperative localization can be achieved with the assistance of task priors.(3)To handle the problem that the agent has difficulty in inferring the environment state in the high-dimensional observation POMDP with unknown model priors and incomplete observations,an action-dependent bidirectional contrastive predictive coding for belief representation learning is proposed.Good belief representations can provide reasonable basis for decision making.In the presented method,the observation encoders,belief transition and prediction models are trained end-to-end through the historical and future bidirectional prediction errors.Therefore,the upper bound of prediction errors is constrained by the bottleneck belief state,which improves the efficiency and accuracy of self-supervised belief representation learning.To stabilize the training process,we hope to narrow the representation difference between the intersected forward and reverse predictions,and the bidirectional match regularization is derived and adopted as one of the optimization objectives.In addition,the gradient truncation mechanism is used to explore the interpretability of the learned belief representations.The simulation results indicate that apart from achieving highly accurate belief tracking,the state uncertainties could be characterized reasonably,which provides a guarantee for solving the POMDP optimal policy for downstream tasks.(4)To combat the problem of poor policy learning performance because of the unavailable environment state in the model unknown POMDP with high-dimensional observation space,a double deep Q-network reinforcement learning algorithm based on the representation of the contrastive predictive coding is proposed.In general,standard deep reinforcement learning algorithms assume that observations contain the complete state information for decision making,but this assumption is not applicable to POMDPs.Therefore,the belief states are modeled explicitly in the proposed algorithm to obtain a compact and efficient history encoding for the policy optimization.To improve data efficiency,the belief replay buffer is introduced to reduce the memory usage by directly storing the belief transition pairs instead of the observation and action sequences.In addition,the phased training strategy is designed for decoupling the representation learning from the policy learning process to improve training stability.The simulation results show that the agent can break the “perceptual aliasing” dilemma with the help of the proposed algorithm,which facilitates to achieve stable and efficient policy learning in POMDPs.In summary,this dissertation focuses on the state estimation and policy learning of complex POMDPs,and novel methods that take both performance and efficiency into account are proposed.They have important theoretical significance and practical application value for solving sequential decision problems in dynamic and uncertain environments.
Keywords/Search Tags:Partially observable Markov decision process, Adaptive Kalman filter, Multi-agent cooperative localization, Belief representation learning, Deep reinforcement learning
PDF Full Text Request
Related items