| The application of generating model is very extensive.It can be used to model different data,such as image,text,sound,etc.Among them,the deep generation model is to use the ability of deep neural network to approximate any function to model a complex distribution.As an important part of machine learning,reinforcement learning is becoming more and more popular in robotic systems,machine translation,recommendation systems and so on.Traditional reinforcement learning methods can learn from the interactions with the environment.However,most sequential decision problems cannot give out an immediate reward signal that the algorithm may need.This becomes the bottleneck in applying reinforcement learning algorithms into more complex decision problems.The procedure for inverse reinforcement learning is to design an optimization model to recover the reward function in Markov decision process.So far,the algorithm in imitation learning which combines inverse reinforcement learning and direct reinforcement learning algorithms have made a great progress,includes: Stanford autonomous helicopter,HAVC control,self-car driving and so on.Owing to that imitation learning can help to make decisions by the expert demonstrations,imitation learning is becoming more and more popular.In this paper,we first propose a new generation countermeasure neural network SSGAN.It attributes the failure of the initial manifold to mode collapse.SSGAN adds a supervisory signal to GANs to keep the generated manifold close to the real data manifold during initialization.The experimental results show that this method is superior to other latest GANs training methods in both visual quality and pattern capture.This paper proposes a new imitation learning algorithm of unknown area perception,which only learns the distribution of transformation in expert data,not compared with expert data.Actors will better capture expert behavior from expert demonstrations.We extend GAIL to UAIL by adding an auto-encoder to predict the similarity between sampling state and expert state.Therefore,a new criterion can be used to train the discriminator.In this way,strategies can better capture the behavior of experts and reduce the impact of the environment.Experiments show that our new method,can surpass GAIL in several Atari 2600 games and multiple Mu Jo Co environments.At the same time,this paper introduces the application of imitation learning in JD e-commerce platform.In order to overcome the high physical cost of commodity search RL training in jd.com,we use GAN-SD and MAIL to build a virtual JD simulator based on historical data.The empirical results show that the model can truly reflect the characteristics of the real environment.Then in the virtual JD,using the proposed strategy to train a better engine strategy,the results show that the strategy has better performance than the traditional supervised learning method. |