# Is the optimal policy always stochastic if the environment is also stochastic?

Is the optimal policy always stochastic (that is,a map from states to a probability distribution over actions) if the environment is also stochastic?

Intuitively,if the environment isdeterministic(that is,if the agent is in a state$$s$$and takes action$$a$$,then the next state$$s'$$is always the same,无论时间步),then the optimal policy should also be deterministic (that is,it should be a map from states to actions,and not to a probability distribution over actions).

For example,consider themulti-armed bandit problem.So,you have$$n$$arms which all have a probability of giving you a reward (1 point,for example),$$p_i$$,$$i$$being between 1 and$$n$$.This is a simple stochastic environment: this is a one state environment,but it is still an environment.

But obviously the optimal policy is to choose the arm with the highest$$p_i$$.So this is not a stochastic policy.

Obviously,if you are in an environment where you play against other agent (a game theory setting),your optimal policy will certainly be stochastic (think of a poker game,for example).

• Why would it be obvious to always choose the arm with the highest $p_i$?$p_i$ is a probability,so it's not certain you will always obtain the highest amount of reward (at least,in finite time) if you always choose the arm $i$. nbro Feb 15 at 14:03
• @nbro: It is certain in expectation,which is what the optimal policy maximises.Policies don't try to second-guess random number generators,that is assumed impossible (if it were possible due to some internal state of the system,you must either add that internal state to the model,or treat as a POMDP) Neil Slater Feb 15 at 14:07
• @NeilSlater Ok.But would the conclusion change if time is finite?If you have a limited amount of time to play,then the expectation,I guess,must also consider the available time to play. nbro Feb 15 at 14:11
• @nbro: That may change your decisions,but is not really about the optimal policy.The optimal policy for the bandit arms is still deterministic,about using the best arm,but you don't know it.This is about exploration vs exploitation.Youcouldphrase that as having "an optimal policy for exploring a bandit problem" perhaps.Not the terminology used in e.g.Sutton & Barto,but perhaps some parctioners do say that,I don't know... Neil Slater Feb 15 at 14:15
• The environment contains only one state in which you face the same decision over and over : which arm I have to choose ? Adrien Forbu Feb 16 at 19:24

Is the optimal policy always stochastic (that is,a map from states to a probability distribution over actions) if the environment is also stochastic?

No.

An optimal policy is generally deterministic unless:

• Important state information is missing (a POMDP).For example,in a map where the agent is not allowed to know its exact location or remember previous states,and the state it is given is not enough to disambiguate between locations.If the goal is to get to a specific end location,the optimal policy may include some random moves in order to avoid becoming stuck.Note that the environment in this case could be deterministic (from the perspective of someone who can see the whole state),but still lead to requiring a stochastic policy to solve it.

• There is some kind of minimax game theory scenario,where a deterministic policy can be punished by the environment or another agent.Think scissors/paper/stone or prisoner's dilemma.

Intuitively,if the environment is deterministic (that is,if the agent is in a state and takes action,then the next state ′ is always the same,not matter which time step),then the optimal policy should also be deterministic (that is,it should be a map from states to actions,and not to a probability distribution over actions).

That seems reasonable,but you can take that intuition further with any method based on a value function:

If you have found an optimal value function,then acting greedily with respect to itisthe optimal policy.

The above statement is just a natural language re-statement of the Bellman optimality equation:

$$v^*(s) = \text{max}_a \sum_{r,s'}p(r,s'|s,a)(r+\gamma v^*(s'))$$

i.e.the optimal values are obtained when always choosing the action that maximises reward plus discounted value of next step.The$$\text{max}_a$$operation is deterministic (if necessary you can break ties for max value deterministically with e.g.an ordered list of actions).

Therefore,any environment that can be modelled by a MDP and solved by a value-based method (e.g.value iteration,Q-learning) has an optimal policy which is deterministic.

It is possible in such an environment that the optimal solution may not be stochastic at all (i.e.if you add any randomness to the deterministic optimal policy,the policy will become strictly worse).However,when there are ties for maximum value for one or more actions in one or more states then there are multiple equivalent optimal and deterministic policies.You may construct a stochastic policy that mixes these in any combination,and it will also be optimal.

• "It is possible in such an environment that no stochastic policy is optimal",you mean deterministic policy? nbro Feb 15 at 14:14
• @nbro: No,I really mean that there is no optimal stochastic policy.This is commonly the case.Think for example of a simple maze solver.If the optimal deterministic solution is a single path from start to exit,adding any randomness at all to it will make the policy strictly worse.This does not change if the environment adds random noise (e.g.moves sometimes failing) Neil Slater Feb 15 at 14:24
• I understand now.You're saying that there's always a deterministic policy,then a policy which is stochastic and derived from the deterministic policy will likely be worse than the optimal deterministic policy. nbro Feb 15 at 14:39
• @nbro: Yes,that's it. Neil Slater Feb 15 at 15:12

I'm thinking of a probability landscape,in which you find yourself as an actor,with various unknown peaks and troughs.A good deterministic approach is always likely to lead you to the nearest local optimum,but not necessarily to the global optimum.To find the global optimum,something like an MCMC algorithm would allow to stochastically accept a temporarily worse outcome in order to escape from a local optimum and find the global optimum.My intuition is that in a stochastic environment this would also be true.

The form of a solution in a stochastic environment is always a hedge across allstates