This Python package contains implementations of methods from different papers dealing with the contextual bandits problem, as well as adaptations from typical multi-armed bandits strategies. It aims to provide an easy way to prototype and compare ideas, to reproduce research papers that don't provide easily-available implementations of their proposed algorithms, and to serve as a guide in learning about contextual bandits.
Package is available on PyPI, can be installed with
pip install contextualbandits
Contextual bandits, also known as multi-armed bandits with covariates or associative reinforcement learning, is a problem similar to multi-armed bandits, but with the difference that side information or covariates are available at each iteration and can be used to select an arm, whose rewards are also dependent on the covariates.
The problem comes from an iterative process generating data as follows:
- At each round, the world creates an observation consisting of a set of covariates
X
of fixed dimension and a vector of rewardsr
(which are stochastic but dependent on the covariates) of length corresponding tom
, the number of arms. - An agent must choose an arm or label for the observation among the set of
m
arms. - The world reveals the reward for the arm chosen by the agent, but not for the other arms.
The aim is of course to build a policy that would maximize the rewards obtained by the agent. The arms might also expire over time and new arms might appear too, leading to the same exploration-exploitation dilemma faced in multi-armed bandits.
The problem is very similar to multi-class or multi-label classification (with the reward being whether the right label was chosen or not), but with the big difference that the right label or set of labels is not known for each observation, only whether the label that was chosen by the agent for each observation was correct or not.
Examples of such scenarios include online advertising, where we only know whether a user clicked an ad that he was presented with, but don't know which other ads he would have clicked; or clinic trials where we know how a person responded to a treatment, but don't know how he would have responded to a different treatment.
While, in general, algorithms for the contextual bandits problem assume continuous rewards in the range [0,1]
, this package deals only with the case of discrete rewards {0,1}
, and only with the case of arms that all share the same features.
Three of the main problematics that arise in contextual bandits are:
- Building a policy or strategy for making choices in an online setting that would manage exploration of new arms under different features, or exploitation of arms that are known to be good for some features – covered in
contextualbandits.online
. - Building a policy/algorithm with data collected from a different policy, covered in
contextualbandits.offpolicy
. - Evaluating the performance of different strategies/policies/algorithms based on partially-labeled data, covered in
contextualbandits.evaluation
.
For more information, see the user guides below.
Most of the algorithms here are meta-heuristics that take a binary classifier as a base, such as Logistic Regression.
You can find detailed usage examples with public datasets in the following IPython notebooks:
- Online Contextual Bandits
- Off-policy Learning in Contextual Bandits
- Policy Evaluation in Contextual Bandits
Package documentation is available in readthedocs: http://contextual-bandits.readthedocs.io/en/latest/
Documentation is also internally available through docstrings (e.g. you can try help(contextualbandits.online.BootstrappedUCB)
, help(contextualbandits.online.BootstrappedUCB.fit)
, etc.).
Implementations in this package include:
Online:
- LinUCB (see [1])
Adaptations from multi-armed bandits strategies:
- Upper-confidence Bound (see [5] and [2])
- Thompson Sampling (see [2] and [3])
- Epsilon Greedy (see [6] and [5])
- Adaptive Greedy (see [4])
- Explore-Then-Exploit (see [6])
Other:
- Exploration based on active learning
Off-policy:
- Offset Tree (see [7])
- Doubly-Robust Policy Optimization (see [9])
Evaluation:
- Rejection Sampling (see [1])
- Doubly-Robust Policy Evaluation (see [8])
Some of the policies here such as AdaptiveGreedy
haven't been yet evaluated or analyzed in any research papers for the scenario of contextual bandits though, and there might be better ways to convert the corresponding multi-armed bandits strategies to contextual bandits strategies than how it was done here.
Note that the algorithms are meant for an "online" scenario in the sense of needing to make decisions in real time and receive feedback shortly afterwards, but most of them still require to be refit to all historical data every time rather than being "online" in the sense of being fit to batches of data as they come. Implementations of truly online algorithms (other than LinUCB) to follow in the future.
Many of the algorithms here oftentimes don't manage to beat simpler benchmarks (e.g. Offset Tree vs. a naïve One-Vs-Rest using only subsets of the data for each classifier), and I wouldn't recommend relying on them. They are nevertheless provided for comparison purposes.
[1] Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010, April). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web (pp. 661-670). ACM.
[2] Chapelle, O., & Li, L. (2011). An empirical evaluation of thompson sampling. In Advances in neural information processing systems (pp. 2249-2257).
[3] Agrawal, S., & Goyal, N. (2013, February). Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning (pp. 127-135).
[4] Chakrabarti, D., Kumar, R., Radlinski, F., & Upfal, E. (2009). Mortal multi-armed bandits. In Advances in neural information processing systems (pp. 273-280).
[5] Vermorel, J., & Mohri, M. (2005, October). Multi-armed bandit algorithms and empirical evaluation. In European conference on machine learning (pp. 437-448). Springer, Berlin, Heidelberg.
[6] Yue, Y., Broder, J., Kleinberg, R., & Joachims, T. (2012). The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5), 1538-1556.
[7] Beygelzimer, A., & Langford, J. (2009, June). The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 129-138). ACM.
[8] Dudík, M., Langford, J., & Li, L. (2011). Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601.
[9] Dudík, M., Erhan, D., Langford, J., & Li, L. (2014). Doubly robust policy evaluation and optimization. Statistical Science, 485-511.