Questions tagged [multi-armed-bandit]

For questions related to the multi-armed bandit (MAB) problem, in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation.

13 questions
Filter by
Sorted by
Tagged with
0
votes
0answers
4 views

Is there a multi-agent version of EXP3?

The EXP3 algorithm as given in the figure below (taken from Adversarial Bandits and the Exp3 Algorithm) is to solve the adversarial bandits for the single-player case. What happens if there are ...
0
votes
1answer
60 views

Why am I getting better performance with Thompson sampling than with UCB or $\epsilon$-greedy in a multi-armed bandit problem? [closed]

I ran a test using 3 strategies for multi-armed bandit: UCB, $\epsilon$-greedy, and Thompson sampling. The results for the rewards I got are as follows: Thompson sampling had the highest average ...
5
votes
1answer
167的浏览量

How do I recognise a bandit problem?

I'm having difficulty understanding the distinction between a bandit problem and a non-bandit problem. An example of the bandit problem is an agent playing $n$ slot machines with the goal of ...
1
vote
1answer
25 views

When discounted MAB is useful?

当总奖励是所有奖励的总和许多多臂老虎算法使用。然而,在RL,折扣奖励主要使用。为什么打折的奖励是不是在当时的MAB ...
2
votes
2answers
55 views

Are bandits considered an RL approach?

If a research paper uses multi-armed bandits (either in their standard or contextual form) to solve a particular task, can we say that they solved this task using a reinforcement learning approach? Or ...
1
vote
1answer
26 views

How do we reach at the formula for UCB action-selection in multi-armed bandit problem?

I came across the formula for Upper Confidence Bound Action Selection (while studying multi-armed bandit problem), which looks like: $$ A_t \dot{=} \operatorname{argmax}_a \left[ Q_t(a) + c \sqrt{ \...
1
vote
0answers
33 views

How do I determine the optimal policy in a bandit problem with missing contexts?

Suppose I learn an optimal policy $\pi(a|c)$ for a contextual multi-armed bandit problem, where the context $c$ is a composite of multiple context variables $c = c_1, c_2, c_3$. For example, the ...
2
votes
1answer
91 views

Solving a Multi-Armed, “Multi-Bandit” Problem

This is the problem: I have 66 slot-machines and for each of them I have 7 possible actions/arms to choose from. At each trial, I have to choose one of 7 actions for each and every one of the 66 slots....
2
votes
1answer
382 views

How to implement a contextual reinforcement learning model?

In a reinforcement learning model, states depend on the previous actions chosen. In the case in which some of the states -but not all- are fully independent of the actions -but still obviously ...
1
vote
1answer
228 views

Large action set in multi-armed bandits

Suppose one is using a multi-armed bandit, and one has relatively few "pulls" (i.e. timesteps) relative to the action set. For example, maybe there are 200 timesteps and 100 possible actions. ...
1
vote
0answers
29 views

Name of a multiarmed bandit with only some levers available

In order to model a card game, as an exercise, I was thinking of an elementary setting as a multiarmed bandit, each lever being the distribution of expected rewards of a specific card. But, of course,...
2
votes
1answer
59 views

Programming a bandit to optimize donations

I'm developing a multi-armed bandit which learns the best information to display to persuade someone to donate to charity. Suppose I have treatments A, B, C, D (which are each one paragraph of text). ...
5
votes
1answer
163 views

What is a weighted average in a non-stationary k-armed bandit problem?

In the book Reinforcement Learning: An Introduction (page 25), by Richard S. Sutton and Andrew G. Barto, there is a discussion of the k-armed bandit problem, where the expected reward from the bandits ...