| The routing problems of logistics vehicles related to the efficiency of logisticsactivities and the costs of transportation and is an important issue of our logistics industrywhich must be solved during industrial transformation and upgrading. The rapiddevelopment and wide application of information technology and communicationtechnology, and the new logistics forms such as fresh logistics in E-commerce contributedto the developing and application of real-time dynamic routes planning methods. In thisarticle, we start from the new situation of current logistics activities and aim to solve thecurses of dimensionality in stochastic vehicle routing problem. With the theory ofapproximate dynamic programming and function approximation techniques, we modelingthe fundamental problem of multi-vehicles routing problem with stochastic demand andduration limits which underlying many operational challenges in the fields of logistics, andfurthermore proposing two corresponding algorithms. The main contents are as follows.First, we model the multi-vehicles routing problem with stochastic demand andduration limits as large-scale Markov decision processes. The objective function of modelis maximizing the total expected demand of served customers and the state variables whichstore information needed for decisions consisted of vehicles’ state and customers’ state.Secondly, on the basis of model, we develop two on-line learning algorithms:approximate policy iteration based on linear function approximation (RLSTD-API) andapproximate value iteration based on basis function optimization (CEO-AVI). RLSTD-APIutilizes the k-means clustering algorithm to capture important features of state variablesand then put forward an algorithm which in accordance with approximate policy iterationframework of approximate dynamic programming, in which recursive least squarestemporal difference is combined as policy evaluation. The second one is motivated by thedifficulties in manually design basis functions which achieve good approximation resultswithout prior knowledge. Through the combination of off-policy learning algorithm in thefield of reinforcement learning and the method of cross-entropy optimization, we propose an approximate value iteration algorithm based on basis function optimization.Finally, we demonstrate the validity of two algorithms by tested them on benchmarksand compared with other algorithms and analyze their scope of application respectively. |