Font Size: a A A

Research On Reinforcement Learning Recommendation Method For Virtual Taobao

Posted on:2022-11-19Degree:MasterType:Thesis
Country:ChinaCandidate:X F ZhangFull Text:PDF
GTID:2518306758491944Subject:Trade Economy
Abstract/Summary:PDF Full Text Request
Internet shopping has spread to all corners of the world,and it is of great significance to recommend product lists to users by deploying multi-dimensional recommendation algorithms on online shopping platforms to promote product sales and meet users' shopping needs.However,traditional recommendation algorithms and those based on deep learning in recent years cannot model long-term rewards.In the ecommerce recommendation scenario,the recommendation engine and system users constitute each other's environment.The interaction process between the two is a series of timing dependent decision-making sequences,which are very consistent with the Markov decision-making process.Reinforcement learning algorithms are used to maximize the long-term revenue of the platform.It is undoubtedly an important idea for a breakthrough in the current e-commerce recommendation.In recent years,major e-commerce businesses have begun to try to use reinforcement learning to recommended product list.The internal principle of reinforcement learning is to let the agent learn and explore independently like a human,and continuously optimize its own action strategy through self-trial and environmental feedback reward signals in the environment,so that it can make benefits for different observed environmental states using good action.The e-commerce trading platform involves the interests of businesses and hundreds of millions of consumers,and the cost of model development and training in a real online environment is huge and unbearable.Therefore,a simulation environment that fits the real engineering environment is the first step to implement reinforcement learning in e-commerce.Domestic e-commerce platforms such as Taobao,Jing Dong and Meituan have invested a lot of human and financial resources in reinforcement learning recommendation.At present,the e-commerce simulation environment that can be used as open source is the simulation environment "Virtual Taobao" jointly developed by Alibaba Group and Nanjing University.This paper uses "Virtual Taobao" to train three current reinforcement learning algorithms with superior performance: PPO,SAC,TD3,and achieves better learning results than the DDPG used by the original author.Combined with the method principle and experimental effect,this paper adjusts PPO,SAC,and TD3 to varying degrees to improve the learning results.PPO's sampling efficiency is very poor,causing the lowest learning result among all methods.This paper uses four sub-threads to interact with the environment to collect training trajectories for the main thread to learn.In addition,strict comparison experiments were conducted on important parameters such as the number of repeated times of each sample.The policy network of SAC may generate multiple boundary actions,which will increase calculation errors during the training process.This paper uses a new formula to calculate the policy entropy to reduce calculation errors,which improves the learning result and stability of SAC in "Virtual Taobao".TD3 is very sensitive to the environment,this paper designs three important adjustments to TD3.First,this paper lets the target policy network and the current policy network predict two candidate actions for the current state of the virtual environment respectively,and selects the candidate action with a larger value evaluated by the current critic network,aiming to improve the learning efficiency of the policy network.Second,the OrnsteinUhlenbeck(OU)process is used as the exploration noise to improve the agent's ability to explore Virtual Taobao.Third,prioritized experience replay is adopted to improve sampling efficiency.Applied the above adjustments,the learning result of TD3 in "Virtual Taobao" has been greatly improved.In this paper,the PPO,SAC,TD3 trained and adjusted in "Virtual Taobao" will be represented by DPPO(Distributed PPO,referred to as DPPO),SAC?E(SAC with new Entropy formula,referred to as SAC?E),TTD3(TD3 with three adjustments,referred to as TTD3)respectively,which is convenient for description and comparison.CTR(Click-Through-Rate,referred to as CTR)is used to measure the performance of all methods.The learning results of the adjusted PPO,SAC,and TD3 are significantly improved.Because the poor performance of PPO,the average CTR of DPPO is much higher than PPO.The average CTR of SAC?E is about 15 percentage points higher than SAC.The average CTR of TTD3 is about 9 percentage points higher than TD3.This paper provides a reference model for online tuning of reinforcement learning in ecommerce recommendation scenarios.
Keywords/Search Tags:Electronic commerce, Virtual Taobao, Product display strategy, Reinforcement learning
PDF Full Text Request
Related items