Font Size: a A A

A Research On Linear Bandits With Heavy-tailed Rewards

Posted on:2022-03-10Degree:MasterType:Thesis
Country:ChinaCandidate:B XueFull Text:PDF
GTID:2518306725481444Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development of information technology,the data generated by mankind has grown rapidly,and the era of big data has arrived.Online learning assumes the continuous arrival of training data and usually uses a training sample to update the current model,which significantly reduces the space and time complexity of the learning algorithm and has strong real-time performance.In the era of big data,online learning has become an important technical means to solve the problem of large scale and fast growth of big data.As a branch of online learning,bandits model can be updated only by limited information,which enjoys an attractive combination of computational efficiency and theoretical guarantees.The linear bandits model takes each arm as an context vector so as to fully utilize the context information.Most of the existing work assume the rewards are bounded or sub-Gaussian,which may be violated in some scenarios such as financial markets and web-advertising.When the rewards satisfy the heavy-tailed distribution,the learner receives extreme rewards with high probability,and these extreme values seriously affect the decision-making process of the learner.The main contributions are as follows.First,for the stochastic linear bandits with finite arms,we propose two novel algorithms based on median of means and truncation,called SupBMM and SupBTC respectively,which enjoy a sublinear regret bound of ?(d1/2T1/1+?),where d is the dimension of contextual information and T is the time horizon.For the linear bandits with heavy-tailed rewards,the core lies in how to combine heavytailed strategies(median of means and truncation)with linear bandits' technique,so as to enhance the robustness of bandits model.To adopt median of means in bandit learning,we need to play the chosen arm several times,and then select the median of these rewards as the training data to update the model.For the truncation strategy,we only need to play the selected arm one time in each round,and then the reward obtained is truncated according to the truncation criterion,and finally the truncated reward is used to update the model.SupBMM plays the chosen arm several times in each round,so the model is updated fewer times than SupBTC,resulting in a poor regret bound.However,SupBTC needs to truncate all the historical rewards in each round,so SupBTC has a higher computational complexity.Meanwhile,we provide an ?(d?/1+?T1/1+?)lower bound,which implies that our upper bound matches the lower bound up to polylogarithmic factors in the order of d and T when ?=1.Second,in order to further expand the scope of application of bandits model,we study the generalized linear bandits with heavy-tailed rewards.The main difference between the generalized linear bandits model and the stochastic linear bandits model lies in the way of updating the parameters.The generalized linear bandit adopts the online Newton method,while the stochastic linear bandits'estimator is the least square estimator.We propose two novel algorithms based on median of means and truncation for the generalized linear bandits,called CRMM and CRTM respectively,which enjoy a sublinear regret bound of ?(dT1/1+?).It is worth to point out that CRMM reduces the number of estimators required by the median of means-based algorithm from O(log T)to 1,and CRTM avoids the shortcoming that the truncation-based algorithm cannot realize online learning.Therefore,our algorithms are much more efficient than the existing algorithms of heavy-tailed bandits model.CRMM and CRTM are almost optimal since the lower bound for heavy-tailed stochastic linear bandits problem is ?(dT1/1+?).
Keywords/Search Tags:bandits, heavy-tailed, median of means, truncation, regret
PDF Full Text Request
Related items