Research On Reinforcement Learning Recommendation Method For Virtual Taobao

Posted on:2022-11-19

Degree:Master

Type:Thesis

Country:China

Candidate:X F Zhang

Full Text:PDF

GTID:2518306758491944

Subject:Trade Economy

Abstract/Summary:

PDF Full Text Request

Internet shopping has spread to all corners of the world,and it is of great significance to recommend product lists to users by deploying multi-dimensional recommendation algorithms on online shopping platforms to promote product sales and meet users' shopping needs.However,traditional recommendation algorithms and those based on deep learning in recent years cannot model long-term rewards.In the ecommerce recommendation scenario,the recommendation engine and system users constitute each other's environment.The interaction process between the two is a series of timing dependent decision-making sequences,which are very consistent with the Markov decision-making process.Reinforcement learning algorithms are used to maximize the long-term revenue of the platform.It is undoubtedly an important idea for a breakthrough in the current e-commerce recommendation.In recent years,major e-commerce businesses have begun to try to use reinforcement learning to recommended product list.The internal principle of reinforcement learning is to let the agent learn and explore independently like a human,and continuously optimize its own action strategy through self-trial and environmental feedback reward signals in the environment,so that it can make benefits for different observed environmental states using good action.The e-commerce trading platform involves the interests of businesses and hundreds of millions of consumers,and the cost of model development and training in a real online environment is huge and unbearable.Therefore,a simulation environment that fits the real engineering environment is the first step to implement reinforcement learning in e-commerce.Domestic e-commerce platforms such as Taobao,Jing Dong and Meituan have invested a lot of human and financial resources in reinforcement learning recommendation.At present,the e-commerce simulation environment that can be used as open source is the simulation environment "Virtual Taobao" jointly developed by Alibaba Group and Nanjing University.This paper uses "Virtual Taobao" to train three current reinforcement learning algorithms with superior performance: PPO,SAC,TD3,and achieves better learning results than the DDPG used by the original author.Combined with the method principle and experimental effect,this paper adjusts PPO,SAC,and TD3 to varying degrees to improve the learning results.PPO's sampling efficiency is very poor,causing the lowest learning result among all methods.This paper uses four sub-threads to interact with the environment to collect training trajectories for the main thread to learn.In addition,strict comparison experiments were conducted on important parameters such as the number of repeated times of each sample.The policy network of SAC may generate multiple boundary actions,which will increase calculation errors during the training process.This paper uses a new formula to calculate the policy entropy to reduce calculation errors,which improves the learning result and stability of SAC in "Virtual Taobao".TD3 is very sensitive to the environment,this paper designs three important adjustments to TD3.First,this paper lets the target policy network and the current policy network predict two candidate actions for the current state of the virtual environment respectively,and selects the candidate action with a larger value evaluated by the current critic network,aiming to improve the learning efficiency of the policy network.Second,the OrnsteinUhlenbeck(OU)process is used as the exploration noise to improve the agent's ability to explore Virtual Taobao.Third,prioritized experience replay is adopted to improve sampling efficiency.Applied the above adjustments,the learning result of TD3 in "Virtual Taobao" has been greatly improved.In this paper,the PPO,SAC,TD3 trained and adjusted in "Virtual Taobao" will be represented by DPPO(Distributed PPO,referred to as DPPO),SAC?E(SAC with new Entropy formula,referred to as SAC?E),TTD3(TD3 with three adjustments,referred to as TTD3)respectively,which is convenient for description and comparison.CTR(Click-Through-Rate,referred to as CTR)is used to measure the performance of all methods.The learning results of the adjusted PPO,SAC,and TD3 are significantly improved.Because the poor performance of PPO,the average CTR of DPPO is much higher than PPO.The average CTR of SAC?E is about 15 percentage points higher than SAC.The average CTR of TTD3 is about 9 percentage points higher than TD3.This paper provides a reference model for online tuning of reinforcement learning in ecommerce recommendation scenarios.

Keywords/Search Tags:

Electronic commerce, Virtual Taobao, Product display strategy, Reinforcement learning

PDF Full Text Request

Related items

1	Development Of The Platform For Virtual Product Display And E-Commerce Based On Cloud Computing Platform
2	Research On Virtual Reality Product Display System Based On VRML
3	Implementation Of Learning System For Electronic Commerce Website Design
4	The Virtual Of Display Design Application In Electronic Commerce
5	The Research And Implementation Of E-commerce Display Platform Based On Vr
6	The Design And Implementation Of Time Card Based On Taobao Electronic Certificate Trading System
7	The Product Development Program Of Taobao's "Zaodianxinhuo" Channel Based On IP Virtual Derivatives Development
8	Research Of Automated Negotiation Based On Reinforcement Learning
9	Research On The Communication Strategy Of Taobao Live In The Era Of Fan Economy
10	Design And Implementation Of An E-commerce System Optimized Using Intelligent Strategies