| Large manufacturing enterprises,whose product and customers are distributed nationwide or even all over the world,mostly adopt a multi-echelon structure for their spare parts inventory in order to reduce inventory costs and improve response speed.However,because the state space and action space of multi-echelon spare parts inventory optimization problem grows exponentially with the number of warehouses,the policy optimization model is difficult to establish and solve.In this thesis,Action Branching and Wolpertinger structure is applied to deep reinforcement learning algorithm to solve the state and action space explosion problem in multi-echelon spare parts inventory optimization in the context of the actual needs of a wind power manufacturing enterprise.The policy structure obtained from the deep reinforcement learning algorithm is systematically studied and explained.First,a multi-echelon spare parts inventory system with a central warehouse and multiple local warehouses is investigated in this thesis.The replenishment policy of each warehouse is optimized by taking the average cost per unit time as the objective function and considering the emergency transportation and time window.In this process,a Markov decision process model is constructed,action branching and value-based deep reinforcement learning algorithm are used to solve the optimal policy,and reward shaping is performed using a priori policy to improve the performance of algorithm.Practical cases in a wind power company show that the replenishment policy obtained in this study are more cost-effective than the parameterized inventory policy obtained by genetic algorithm.Based on the above research contents,a policy gradient deep reinforcement learning algorithm is used to optimize a multi-echelon spare parts inventory replenishment policy containing multiple central warehouses,in which the lateral transshipment between central warehouses is considered.To solve the action space explosion problem,the replenishment decision variables are treated as continuous variables,and the Wolpertinger architecture is used to select feasible actions in the continuous action space.In the actual case of a wind power company,the proposed algorithm in this study can obtain higher quality solutions in shorter time compared to genetic algorithm.Finally,this thesis explains the inventory policy obtained by the deep reinforcement learning algorithm and discusses its effectiveness.In order to solve the problem that the replenishment policy is difficult to be understood and explained due to the "black box" property of deep reinforcement learning,and to improve the credibility and persuasiveness of the obtained policy,this thesis adopts the Local Interpretable Model-Agnostic Explanations to fit the model to the replenishment policy under the classical inventory position,and obtains the following results for different cases the system state variables that affect the optimal replenishment amount and the strength of the influence of these variables on the replenishment decision.The results show that the inventory policy obtained from deep reinforcement learning are reasonably interpretable and provide a reference for proposing a multi-echelon inventory heuristic policy structure.In summary,this thesis solves the action space explosion problem when deep reinforcement learning is applied to multi-echelon inventory decision making,improves the performance of deep reinforcement learning using priori policy,and systematically explains and analyzes the policy obtained from deep reinforcement learning.Practical cases validate the effectiveness of the method proposed in this thesis. |