Font Size: a A A

Application Of Approximate Dynamic Programming And Reinforcement Learning To Lost-Sales Models And Perishable Product Models

Posted on:2021-06-20Degree:MasterType:Thesis
Country:ChinaCandidate:W J ZhanFull Text:PDF
GTID:2510306302976139Subject:Financial Information Engineering
Abstract/Summary:PDF Full Text Request
The Lost-Sales model and Perishable product model are two classic but tricky problems in stock management.Among them,the scenario considered in the lost-sales model is that,for the seller,it takes a period of time for the goods to arrive from the order.During this period,if the seller does not have enough inventory to fully meet the customer's demand when the demand arrives,the unmet demand will be lost,bringing huge losses to the seller.The Perishable product model deals with Perishable products,such as fresh fruits and vegetables,seafood,etc.,which have a short shelf life and must be disposed of when they expire,which can be costly.In addition,it is also considered in the Perishable product model that it takes some time for the goods to arrive from the order.However,if the seller does not have enough inventory to fully meet the customer's demand when the demand arrives during the period,the unmet demand will not be lost,but will be allowed to be deferred to make up for the missing demand,at a cost.Both the Lost-Sales model and the Perishable product model require a state vector to be defined to represent the stock state,the dimension of which is related to the lead time of the order and the shelf life of the goods.The longer the lead time and shelf life,the larger the dimension of the state vector and the larger the state space.The huge state space will bring "dimension disaster" to the optimization of the model and the formulation of the optimal ordering strategy.In this paper,by using a first with characteristics of L-convex quadratic function to approximate the optimal strategy under the loss function of(cost-to-go function),the original dynamic programming problem is transformed into solving linear programming problem.In the process of constructing a linear programming problem,a large number of sample paths are needed to form constraint conditions.The closer these sample paths are to the sample path generated under the optimal strategy and the more of them are sufficient,the closer the approximate form of the quadratic function of the solved loss function is to the loss function under the optimal strategy.Therefore,for the Lost-Sales model and Perishable product model,this paper respectively adopts SVBS strategy and Myopicstrategy,two heuristic strategies that are easy to implement and have good effect,to generate sample path,and the number of sample path is controlled between 5000 and10000,taking into account the problem of solving time and solving effect.To solve the constructed linear programming problem,the obtained solution is taken as the initial solution,and the initial approximate form of the quadratic function of the loss function under the optimal strategy is obtained.Then,the linear combination of the bellman error of the MDP problem and the policy loss proposed in this paper is taken as the lost function.Bellman error represents the error of the loss function between any two periods under the current ordering strategy.Minimizing Bellman error can maximize the accuracy of the loss function.Policy loss represents the loss caused by the current order decision plus the loss represented by the next loss function minus the loss represented by the current loss function.Minimizing Policy loss can optimize the order strategy and minimize the loss caused by the order decision.Then USES the average of 5 of the Poisson distribution and average of 10 truncated normal distribution for the Lost Sales model and Perishable product demand model generation sequence,every stage of 200 or 1000 collected during the period of using stage of the 200 or 1000 sample data for the current path loss function of the parameters of the quadratic function approximation form is updated,the direction of the update for the loss function of the negative gradient direction.For the same sequence of requirements,the losses generated by the order policy derived before and after each loss function update are compared.Finally,the updated order strategy and the order strategy derived from the loss function obtained by solving the linear programming problem were simulated under the same demand sequence of 12,000 periods,and the average loss of the subsequent 10,000 periods was compared.
Keywords/Search Tags:Approximate Dynamic Programming, Reinforcement learning, Lost-Sales model, Perishable product model
PDF Full Text Request
Related items