We examine a multi-armed bandit problem involving probes, where the agent is permitted to probe one arm for a cost $c \geq 0$ to observe its reward before making a pull. We identify the optimal strategy for deciding whether to probe or pull an arm. In the case of probing an arm, we also make a decision on which arm to pull after observing the probe's outcome. Additionally, we introduce a novel regret definition based on the expected reward of the optimal action. We propose UCBP, a novel algorithm that utilizes this strategy. UCBP achieves a gap-independent regret upper bound that scales with $ \mathcal{O}(\sqrt{K T \log T})$, and an order optimal gap-dependent upper bound that scales with $ \mathcal{O}(K \log T)$. We provide UCB-naive-probe, a naive UCB-based approach which has a gap-independent regret upper bound on the order of $O(\sqrt{K T \log T})$, and gap-dependent regret on the order of $O(K^2 \log T)$ as a baseline. We provide empirical simulations to verify the utility of the UCBP algorithms in practical settings, and show that UCBP outperforms UCB-naive-probe in simulations.