TH1.R3.2

Multi-armed Bandits with Probing

Eray Can Elumar, Carnegie Mellon University, United States; Cem Tekin, Bilkent University, Turkey; Osman Yagan, Carnegie Mellon University, United States

Session:
Multi-Armed Bandits 1

Track:
8: Bandits

Location:
Ypsilon IV-V-VI

Presentation Time:
Thu, 11 Jul, 10:05 - 10:25

Session Chair:
Ali Tajer, Rensselaer Polytechnic Institute
Abstract
We examine a multi-armed bandit problem involving probes, where the agent is permitted to probe one arm for a cost $c \geq 0$ to observe its reward before making a pull. We identify the optimal strategy for deciding whether to probe or pull an arm. In the case of probing an arm, we also make a decision on which arm to pull after observing the probe's outcome. Additionally, we introduce a novel regret definition based on the expected reward of the optimal action. We propose UCBP, a novel algorithm that utilizes this strategy. UCBP achieves a gap-independent regret upper bound that scales with $ \mathcal{O}(\sqrt{K T \log T})$, and an order optimal gap-dependent upper bound that scales with $ \mathcal{O}(K \log T)$. We provide UCB-naive-probe, a naive UCB-based approach which has a gap-independent regret upper bound on the order of $O(\sqrt{K T \log T})$, and gap-dependent regret on the order of $O(K^2 \log T)$ as a baseline. We provide empirical simulations to verify the utility of the UCBP algorithms in practical settings, and show that UCBP outperforms UCB-naive-probe in simulations.
Resources