Font Size: a A A

Research On Theory And Application Of Thompson Sampling In Sequential Decision-Making Problems

Posted on:2020-05-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Y ZhuFull Text:PDF
GTID:1360330578981651Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Many real-world applications can be modelled as sequential decision-making prob-lem.In a sequential decision-making problem,the learning agent interacts with an un-known and noisy environment to maximize the cumulative reward.Thompson sam-pling is one of the most promising strategies to tackle this problem.It utilizes Bayesian strategy to balance the exploration-exploitation tradeoff.Despite being empirically ef-ficient and powerful,Thompson sampling has eluded theoretical understanding.How the information is shared across different strategies is not fully understood and why Thompson Sampling is effective to balance the exploration-exploitation tradeoff is not thoroughly studied.Moreover,in many practical domains,the stochastic and station-ary assumptions about the environment are often violated.These drawbacks hinder the practical deployment of Thompson sampling in highly dynamic and large-scale do-mains.In this thesis,we extend current researches and focus on the theoretical guaran-tee of Thompson sampling and its practical deployment.The details are illustrated as follows:1.By introducing a martingale-based analysis technique,we demonstrate that the frequentist regret upper bound of linear Thompson sampling matches its Bayesian regret bound and the information-theoretic lower bound for this problem.Our analysis precisely qualifies the exploration-exploitation tradeoff and partially re-veals how the information is transferred between different actions in linear set-tings.We also conduct empirical experiments to demonstrate the assumptions and the lemmas.2.By decomposing the instantaneous regret and utilizing matrix martingale theory,we prove that the expected regret upper bound of linear combinatorial cascading Thompson sampling matches the regret bounds for state-of-the-art UCB-based algorithms.Our analysis reveals that linear combinatorial cascading Thompson sampling explores every dimension of feature space and exploits current informa-tion efficiently.We also experiment on two real-world datasets to demonstrate the advantage of our algorithm over existing works.3.We incorporate collaborative effects into Thompson sampling so that the algo-rithm can capture real-time changes of the environment and adjust decision-making strategy accordingly.Our algorithm shows accelerated convergence and improved prediction performance over standard Thompson sampling in collaborative envi-ronments.We also provide regret analyses in both contextual and non-contextual settings.Our theoretical analyses prove that collaborative Thompson sampling outperforms standard Thompson sampling in the collaborative environments.4.We design and analyze collaborative contextual combinatorial cascading Thomp-son sampling to address the sequential decision-making problem in a collabora-tive environment.By analyzing the change of variance of the reward estimator,we prove that the algorithm can efficiently explore the unknown cluster structure in the environment.The experiments conducted on synthetic dataset demonstrate the advantage of incorporating online clustering.
Keywords/Search Tags:Sequential Decision Making, Exploration-Exploitation Tradeoff, Thompson Sampling, Stochastic Model, Contextual Model
PDF Full Text Request
Related items