Is the optimal policy always stochastic (that is,a map from states to a probability distribution over actions) if the environment is also stochastic?

No.

An optimal policy is generally deterministic unless:

Important state information is missing (a POMDP).For example,in a map where the agent is not allowed to know its exact location or remember previous states,and the state it is given is not enough to disambiguate between locations.If the goal is to get to a specific end location,the optimal policy may include some random moves in order to avoid becoming stuck.Note that the environment in this case could be deterministic (from the perspective of someone who can see the whole state),but still lead to requiring a stochastic policy to solve it.

There is some kind of minimax game theory scenario,where a deterministic policy can be punished by the environment or another agent.Think scissors/paper/stone or prisoner's dilemma.

Intuitively,if the environment is deterministic (that is,if the agent is in a state and takes action,then the next state ′ is always the same,not matter which time step),then the optimal policy should also be deterministic (that is,it should be a map from states to actions,and not to a probability distribution over actions).

That seems reasonable,but you can take that intuition further with any method based on a value function:

*If you have found an optimal value function,then acting greedily with respect to it***is**the optimal policy.

The above statement is just a natural language re-statement of the Bellman optimality equation:

$$v^*(s) = \text{max}_a \sum_{r,s'}p(r,s'|s,a)(r+\gamma v^*(s'))$$

i.e.the optimal values are obtained when always choosing the action that maximises reward plus discounted value of next step.The$\text{max}_a$operation is deterministic (if necessary you can break ties for max value deterministically with e.g.an ordered list of actions).

Therefore,any environment that can be modelled by a MDP and solved by a value-based method (e.g.value iteration,Q-learning) has an optimal policy which is deterministic.

It is possible in such an environment that the optimal solution may not be stochastic at all (i.e.if you add any randomness to the deterministic optimal policy,the policy will become strictly worse).However,when there are ties for maximum value for one or more actions in one or more states then there are multiple equivalent optimal and deterministic policies.You may construct a stochastic policy that mixes these in any combination,and it will also be optimal.