The multi-armed bandit (MAB) problem refers to the task of sequentially assigning treatments to experimental units so as to identify the best treatment(s) while controlling the opportunity cost of further investigation. Many algorithms have been developed that attempt to balance this trade-off between exploiting the seemingly optimum treatment and exploring the other treatments. The selection of an MAB algorithm for implementation in a particular context is often performed by comparing candidate algorithms in terms of their abilities to control the expected regret of exploration versus exploitation. This singular criterion of mean regret is insufficient for many practical problems, and therefore an additional criterion that should be considered is control of the variance, or risk, of regret.
This work provides an overview of how the existing prominent MAB algorithms handle both criteria. We additionally investigate the effects of incorporating prior information into an algorithm's model, including how sharing information across treatments affects the mean and variance of regret.
A unified and accessible framework does not currently exist for constructing MAB algorithms that control both of these criteria. To this end, we develop such a framework based on the two elementary concepts of dismemberment of treatments and a designed learning phase prior to dismemberment. These concepts can be incorporated into existing MAB algorithms to effectively yield new algorithms that better control the expectation and variance of regret. We demonstrate the utility of our framework by constructing new variants of the Thompson sampler that involve a small number of simple tuning parameters. As we illustrate in simulation and case studies, these new algorithms are implemented in a straightforward manner and achieve improved control of both regret criteria compared to the traditional Thompson sampler. Ultimately, our consideration of additional criteria besides expected regret illuminates novel insights into the multi-armed bandit problem.
Finally, we present visualization methods, and a corresponding R Shiny app for their practical execution, that can yield insights into the comparative performances of popular MAB algorithms. Our visualizations illuminate the frequentist dynamics of these algorithms in terms of how they perform the exploration-exploitation trade-off over their populations of realizations as well as the algorithms' relative regret behaviors. The constructions of our visualizations facilitate a straightforward understanding of complicated MAB algorithms, so that our visualizations and app can serve as unique and interesting pedagogical tools for students and instructors of experimental design.