Font Size: a A A

Research And Application Of Large-scale Multi-armed Bandit Algorithm

Posted on:2019-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhouFull Text:PDF
GTID:2428330545451208Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Reinforcement learning is an important part in machine learning.The Multi-Armed Bandit(MAB)problem is a typical problem of the exploration and exploitation dilemma in reinforcement learning.The MAB model is widely used in medical experiments,recommendation system,smart grid and crowdsourcing,etc.However,with the increase of application domain,the previous MAB model and algorithm fail to make full use of the feedback information,which lead to poor generalization ability and highly dependence on contextual information.Aiming to address the problems of the current MAB algorithms,this paper puts forward several improved MAB models and algorithms.i.To solve the problems of insufficient use of information and poor generalization ability in existing MAB methods,this paper presents an adaptive MAB algorithm to balance exploration and exploitation based on the Chosen Number of Arm with Minimal Estimation,which makes use of the chosen times and estimations of an action at the same time,namely CNAME in short.An action is chosen according to the exploration probability,which is updated adaptively.In order to further improve the generalization ability of algorithm in different environments,we introduce the parameter w to adjust the influence degree of feedback in different environments during the selection process.The parameter w is used to control the decline rate of exploration probability.The performance of CNAME algorithm can be comparable with the contextual MAB algorithm.ii.In order to enhance the effect of the CNAME algorithm in a large-scale environment,we propose two large-scale bandit approaches under the situations that there is no available priori information.The proposed synchronous and asynchronous MAB algorithms do not rely on the contextual information of large-scale recommendation systems.The algorithms try to maximize expected return(CTR).In the process of exploration,we choose an action randomly instead of the least chosen action.When there are plenty of actions,random selection can improve the effectiveness of exploration and bring the user surprise.In addition,using asynchronous update manner to update exploration probability can improve the efficiency of the algorithm in the large-scale environment.The asynchronous update manner weakens the influence of users' short-term choices on recommendations,which improves the user clicks further.iii.Many studies have proved that the context can help to maximize expected returns.When the context information is available,combined with the contextual information,a large-scale contextual MAB algorithm Con-CNAME is proposed.Firstly,The Con-CNAME algorithm calculates the priori probability of each context among feature matrix.Then,by combining prior probability with action estimated values and using contexts and user feedback at the same time,the effect of algorithm can be further improved when the contexts is available in large-scale environment.At the same time,this paper proposes an action estimation method based on the selection probability of each action.
Keywords/Search Tags:reinforcement learning, multi-armed bandit, large scale, context-aware, recommender system
PDF Full Text Request
Related items