Two-Armed Bandit Problems And Related Limit Theorems

Posted on:2024-05-16

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Z A Zhang

Full Text:PDF

GTID:1528306917994909

Subject:Financial mathematics and financial engineering

Abstract/Summary:

PDF Full Text Request

The two-armed bandit(TAB)problem,originally described by Thompson[55]in 1933 for the application of clinical trial,is a form of the sequential design problem and usually portrayed by a slot machine with two arms,X-and Y-arm.A decision-maker can draw one of two arms at a time and receive a random return based on certain distributions.The goal of the decision-maker is to determine a precise strategy to maximize the expected total return by designing a sequential decision rule,taking into account all the previous choices and results.This classic problem concisely models the trade-offs between exploration(attempting each arm several times to further increase knowledge)and exploitation(pulling the arm believed for reward maximization)in reinforcement learning.Motivated by those trade-offs faced in the real world,such as learning and optimization problems in business and industry,clinical trials,and engineering,the TAB problem has received much attention since it was proposed.Bandit problems also have practical applications.For further importance and applicability of this problem in Ad placement,source routing,computer game-playing,recommendation services,waiting problems,and resource allocation,one can refer to[8],[37].Similar to other problems with unknowns,there are two major approaches to formulating the TABs.One is the so-called frequentist point of view in which the probabilities of success for both arms are deterministic and unknown parameters.When taking this approach,since the effectiveness of a strategy is dependent on the unknown parameters,an obvious challenge is how to compare strategies given that no strategy dominates all other strategies for all parameter values.The other approach is the Bayesian point of view in which the unknown parameters of the probability of success for both arms are random variables with given prior distributions.As the trial progresses,the distributions are continuously updated.Bayes’s theorem provides a convenient mathematical formalism that allows for adaptive learning and is an ideal tool for sequential decision problems.However,the randomness of the posterior probabilities makes it extremely challenging to consider specific problems in this framework.In this dissertation,we take these two approaches separately and study different problems,such as optimal and asymptotically optimal strategies,and obtain some limit theorems,such as laws of large numbers and central limit theorems.Meanwhile,in classical probability theory,the additivity of probability measures plays a crucial role.However,such an additivity assumption has been abandoned in many areas because myriad uncertain phenomena cannot be well modeled by using additive probabilities or linear expectations.Recently,motivated by problems of model uncertainties in statistics,measures of risk,and super-hedging pricing in finance,Peng[43]introduced the framework of sublinear expectation,which provides very effective methods to handle those problems.Peng also initiated the notion of independently and identically distributed random variables and developed a new weak law of large numbers.Later,[18]obtained a new strong law of large numbers.More related results can be found in[13],[17],[30],[31],[32],and[42].Note that most of these results are based on the assumption of independence.However,in many situations,independence usually does not hold,and dependence is an intrinsic feature.In classical probability theory,dependence is often characterized by a series of dependence coefficients,for example,strong-mixing,ρ-mixing,and φ-mixing conditions(see[49],[35],and[5],respectively).By contrast,in the framework of sublinear expectation,it is difficult to define the corresponding conditions.Here comes the question:Is there a proper way to characterize dependence?When dealing with practical problems in finance,linear processes are usually used to describe financial risks(see Peng et al.[44]),so we naturally think of using them to characterize dependence.Besides,it is also an advantageous way in the classical case(see Wu[57]).In this dissertation,we investigate the limit behavior of linear processes in the framework of sublinear expectation and derive laws of large numbers for them.This dissertation has three chapters and is organized as follows:In Chapter 1,we study the TAB problems in the frequentist framework.We first present an asymptotically optimal strategy,which leads to a convergence of the average returns to the larger expectation.Next,we provide a generalized asymptotically optimal strategy such that the average return converges to any point between the maximum and minimum expectations of the two arms.Simulation studies pose supportive evidence that the newly proposed strategies perform well.Then we generalize the central limit theorem of Chen et al.[15],which converges to the Bandit distribution,and apply it to similarity test,which we find to be more powerful than the traditional method based on the normal distribution.The simulation results validate our conclusions.In Chapter 2,we study a type of Bayesian TAB problem,where a prior probability on which arm generated more expectation of the returns is given,and as the trial processes,the corresponding posterior probability is updated continuously.We first consider the existence of an arched reward function,since such a function plays a key role when calculating the second-order moments,variance,probability of intervals,and so on.We propose a myopic strategy and show it is optimal under the structure of the arched reward function by exploring the dynamic programming principle.Then we study the properties of posterior probabilities under different strategies,which are proved to be martingales that tend to 0 or 1 almost surely as the number of trials increases and have the same distribution.Based on these properties,we introduce two other myopic strategies and obtain corresponding laws of large numbers and central limit theorems.In Chapter 3,we investigate the limit behavior of linear processes in the framework of sublinear expectation and derive several laws of large numbers for them.It turns out that our theorems are natural extensions of the ones in the classical linear case,and we can derive the corresponding laws of large numbers for independent random variables under sublinear expectation from our result.

Keywords/Search Tags:

two-armed bandit, optimal strategy, myopic strategy, law of large numbers, central limit theorem, linear process, sublinear expectation

PDF Full Text Request

Related items

1	Nouiehed-Ross’s Conjecture,Berry’s Conjecture For Two-Armed Bandit And Their Applications
2	Research On Optimal Selection Strategy Of Search Engine Keywords Based On Multi-armed Bandit
3	Research On Server Selection Strategy Based On Multi-armed Bandit In Mobile Edge Computing
4	Research And Application Of Large-scale Multi-armed Bandit Algorithm
5	On Image Noise Recognition Based On SVM And Wavelet-transform
6	Portfolio Strategy Based On Multifactor Model And Multi-Armed Bandit Algorithms
7	Research On Multi-armed Bandit Aided Online Learning Approach For Wireless Caching Strategy
8	Research On Resource Allocation And Optimization Strategy Based On Multi-Armed Bandit In Wireless Networks
9	Fluid And Diffusion Approximation For A Multi-User Wireless Communication System
10	Research Of Recommendation Algorithm Based On Contextual Restless Multi-armed Bandit Model