Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp014j03d233b
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorBubeck, Sebastien-
dc.contributor.authorLIU, CHE-YU-
dc.contributor.otherOperations Research and Financial Engineering Department-
dc.date.accessioned2018-02-05T16:49:04Z-
dc.date.available2018-02-05T16:49:04Z-
dc.date.issued2018-
dc.identifier.urihttp://arks.princeton.edu/ark:/88435/dsp014j03d233b-
dc.description.abstractBandit problems are the most basic examples of the sequential decision making problems with limited feedback and an exploitation/exploration trade-off. In these problems, an agent repeatedly selects an action out of a pool of candidates and receives a reward sampled from the action's reward distribution. The agent's goal is to maximize the sum of rewards that he receives over time. The trade-off at each time step is between exploitation of the actions that already produced high rewards in the past and exploration of the poorly understood actions that have the potential to yield even higher rewards in the future. Bandit problems arise naturally in many applications, such as clinical trials, project management and online news recommendation. Thompson Sampling is a popular strategy to solve bandit problems. It selects actions using the "probability matching" principle. In the first part of this thesis, we analyze Thompson Sampling from several different angles. First, we prove a tight bound on Thompson Sampling's performance when the performance is averaged with respect to the prior distribution that Thompson Sampling uses as input. Next, we look at the more realistic non-averaged performance of Thompson Sampling. We quantify the sensitivity of Thompson Sampling's (non-averaged) performance to the choice of input prior, by providing matching upper and lower bounds. Finally, we illustrate Thompson Sampling's ability to optimally exploit prior knowledge by thoroughly analyzing its behavior in a non-trivial example. In the second part of this thesis, we switch our focus to the most-correlated-arms identification problem, where the actions' reward distributions are assumed to be jointly Gaussian and the goal is to find actions with the most mutually correlated rewards. In this problem and unlike in bandit problems, we focus on exploring the actions to acquire as much relevant information as possible and only exploit the acquired information at the end to return the set of correlated actions. We propose two adaptive action-selection strategies and show that they can have significant advantages over the non-adaptive uniform sampling strategy. Our proposed algorithms rely on a novel correlation estimator. The use of this accurate estimator allows us to get improved results for a wide range of problem instances.-
dc.language.isoen-
dc.publisherPrinceton, NJ : Princeton University-
dc.relation.isformatofThe Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: <a href=http://catalog.princeton.edu> catalog.princeton.edu </a>-
dc.subjectBandit Problems-
dc.subjectBayesian Algorithms-
dc.subjectBounded Regrets-
dc.subjectExploration-Exploitation Tradeoff-
dc.subjectPure Exploration-
dc.subjectThompson Sampling-
dc.subject.classificationArtificial intelligence-
dc.subject.classificationComputer science-
dc.subject.classificationStatistics-
dc.titleThompson Sampling for Bandit Problems-
dc.typeAcademic dissertations (Ph.D.)-
pu.projectgrantnumber690-2143-
Appears in Collections:Operations Research and Financial Engineering

Files in This Item:
File Description SizeFormat 
LIU_princeton_0181D_12387.pdf868.77 kBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.