Multi-armed bandit with budget constraint and variable costs
Proceedings of the AAAI Conference on Artificial Intelligence, 2013•ojs.aaai.org
We study the multi-armed bandit problems with budget constraint and variable costs (MAB-
BV). In this setting, pulling an arm will receive a random reward together with a random cost,
and the objective of an algorithm is to pull a sequence of arms in order to maximize the
expected total reward with the costs of pulling those arms complying with a budget
constraint. This new setting models many Internet applications (eg, ad exchange, sponsored
search, and cloud computing) in a more accurate manner than previous settings where the …
BV). In this setting, pulling an arm will receive a random reward together with a random cost,
and the objective of an algorithm is to pull a sequence of arms in order to maximize the
expected total reward with the costs of pulling those arms complying with a budget
constraint. This new setting models many Internet applications (eg, ad exchange, sponsored
search, and cloud computing) in a more accurate manner than previous settings where the …
Abstract
We study the multi-armed bandit problems with budget constraint and variable costs (MAB-BV). In this setting, pulling an arm will receive a random reward together with a random cost, and the objective of an algorithm is to pull a sequence of arms in order to maximize the expected total reward with the costs of pulling those arms complying with a budget constraint. This new setting models many Internet applications (eg, ad exchange, sponsored search, and cloud computing) in a more accurate manner than previous settings where the pulling of arms is either costless or with a fixed cost. We propose two UCB based algorithms for the new setting. The first algorithm needs prior knowledge about the lower bound of the expected costs when computing the exploration term. The second algorithm eliminates this need by estimating the minimal expected costs from empirical observations, and therefore can be applied to more real-world applications where prior knowledge is not available. We prove that both algorithms have nice learning abilities, with regret bounds of O (ln B). Furthermore, we show that when applying our proposed algorithms to a previous setting with fixed costs (which can be regarded as our special case), one can improve the previously obtained regret bound. Our simulation results on real-time bidding in ad exchange verify the effectiveness of the algorithms and are consistent with our theoretical analysis.
ojs.aaai.org
Showing the best result for this search. See all results