# What is the Bellman operator in reinforcement learning?

In mathematics,the word "operator" can refer to several distinct but related concepts.An operator can be defined as a function between two vector spaces,it can be defined as function where the domain and the codomain are the same,or it can be defined as a function from functions (which are vectors) to other functions (for example,thedifferential operator),that is,an high-order function (if you are familiar with functional programming).

What is the Bellman operator in reinforcement learning?Why do we even need it?Where is it used?为什么它被称为一个“操作符”呢?If it is an operator,then it is a map from a function space to another function space.Which are these function spaces?How is the Bellman operator related to the Bellman equations in RL?

The notation I'll be using is fromtwo differentlectures by David Silver and is also informed bythese slides.

The expected Bellman equation is$$v_\pi(s) = \sum_{a\in \cal{A}} \pi(a|s) \left(\cal{R}_s^a + \gamma\sum_{s' \in \cal{S}} \cal{P}_{ss'}^a v_\pi(s')\right) \tag 1$$

If we let$$\cal{P}_{ss'}^\pi = \sum\limits_{a \in \cal{A}} \pi(a|s)\cal{P}_{ss'}^a \tag 2$$and$$\cal{R}_{s}^\pi = \sum\limits_{a \in \cal{A}} \pi(a|s)\cal{R}_{s}^a \tag 3$$then we can rewrite$$(1)$$as

$$v_\pi(s) = \cal{R}_s^\pi + \gamma\sum_{s' \in \cal{S}} \cal{P}_{ss'}^\pi v_\pi(s') \tag 4$$

This can be written in matrix form

$$\left.\begin{bmatrix}v_\pi(1) \\\vdots \\v_\pi(n)\end{bmatrix}=\begin{bmatrix}\cal{R}_1^\pi \\\vdots \\\cal{R}_n^\pi\end{bmatrix}+\gamma\begin{bmatrix}\cal{P}_{11}^\pi & \dots & \cal{P}_{1n}^\pi\\\vdots & \ddots & \vdots\\\cal{P}_{n1}^\pi & \dots & \cal{P}_{nn}^\pi\end{bmatrix}\begin{bmatrix}v_\pi(1) \\\vdots \\v_\pi(n)\end{bmatrix}\right.\tag 5$$

Or,more compactly,

$$v_\pi = \cal{R}^\pi + \gamma \cal{P}^\pi v_\pi \tag 6$$

Notice that both sides of$$(6)$$are$$n$$-dimensional vectors.Here$$n=|\cal{S}|$$is the size of the state space.We can then define an operator$$\cal{T}^\pi:\mathbb{R}^n\to\mathbb{R}^n$$as

$$\cal{T^\pi}(v) = \cal{R}^\pi + \gamma \cal{P}^\pi v \tag 7$$

for any$$v\in \mathbb{R}^n$$.This is the expected Bellman operator.

Similarly,you can rewrite the Bellman optimality equation

$$v_*(s) = \max_{a\in\cal{A}} \left(\cal{R}_s^a + \gamma\sum_{s' \in \cal{S}} \cal{P}_{ss'}^a v_*(s')\right) \tag 8$$

as the Bellman optimality operator

$$\cal{T^*}(v) = \max_{a\in\cal{A}} \left(\cal{R}^a + \gamma \cal{P}^a v\right) \tag 9$$

The Bellman operators are "operators" in that they are mappings from one point to another within the vector space of state values,$$\mathbb{R}^n$$.

Rewriting the Bellman equations as operators is useful for proving that certain dynamic programming algorithms (e.g.policy iteration,value iteration) converge to a unique fixed point.This usefulness comes in the form of a body of existing work in operator theory,which allows us to make use of special properties of the Bellman operators.

Specifically,the fact that the Bellman operators arecontractionsgives the useful results that,for any policy$$\pi$$and any initial vector$$v$$,

$$\lim_{k\to\infty}(\cal{T}^\pi)^k v = v_\pi \tag{10}$$

$$\lim_{k\to\infty}(\cal{T}^*)^k v = v_* \tag{11}$$

where$$v_\pi$$is the value of policy$$\pi$$and$$v_*$$is the value of an optimal policy$$\pi^*$$.The proof is due to thecontraction mapping theorem.