The notation I'll be using is fromtwo differentlectures by David Silver and is also informed bythese slides.

The expected Bellman equation is$$v_\pi(s) = \sum_{a\in \cal{A}} \pi(a|s) \left(\cal{R}_s^a + \gamma\sum_{s' \in \cal{S}} \cal{P}_{ss'}^a v_\pi(s')\right) \tag 1$$

If we let$$\cal{P}_{ss'}^\pi = \sum\limits_{a \in \cal{A}} \pi(a|s)\cal{P}_{ss'}^a \tag 2$$and$$\cal{R}_{s}^\pi = \sum\limits_{a \in \cal{A}} \pi(a|s)\cal{R}_{s}^a \tag 3$$then we can rewrite$(1)$as

$$v_\pi(s) = \cal{R}_s^\pi + \gamma\sum_{s' \in \cal{S}} \cal{P}_{ss'}^\pi v_\pi(s') \tag 4$$

This can be written in matrix form

$$\left.\begin{bmatrix}v_\pi(1) \\\vdots \\v_\pi(n)\end{bmatrix}=\begin{bmatrix}\cal{R}_1^\pi \\\vdots \\\cal{R}_n^\pi\end{bmatrix}+\gamma\begin{bmatrix}\cal{P}_{11}^\pi & \dots & \cal{P}_{1n}^\pi\\\vdots & \ddots & \vdots\\\cal{P}_{n1}^\pi & \dots & \cal{P}_{nn}^\pi\end{bmatrix}\begin{bmatrix}v_\pi(1) \\\vdots \\v_\pi(n)\end{bmatrix}\right.\tag 5$$

Or,more compactly,

$$v_\pi = \cal{R}^\pi + \gamma \cal{P}^\pi v_\pi \tag 6$$

Notice that both sides of$(6)$are$n$-dimensional vectors.Here$n=|\cal{S}|$is the size of the state space.We can then define an operator$\cal{T}^\pi:\mathbb{R}^n\to\mathbb{R}^n$as

$$\cal{T^\pi}(v) = \cal{R}^\pi + \gamma \cal{P}^\pi v \tag 7$$

for any$v\in \mathbb{R}^n$.This is the expected Bellman operator.

Similarly,you can rewrite the Bellman optimality equation

$$v_*(s) = \max_{a\in\cal{A}} \left(\cal{R}_s^a + \gamma\sum_{s' \in \cal{S}} \cal{P}_{ss'}^a v_*(s')\right) \tag 8$$

as the Bellman optimality operator

$$\cal{T^*}(v) = \max_{a\in\cal{A}} \left(\cal{R}^a + \gamma \cal{P}^a v\right) \tag 9$$

The Bellman operators are "operators" in that they are mappings from one point to another within the vector space of state values,$\mathbb{R}^n$.

Rewriting the Bellman equations as operators is useful for proving that certain dynamic programming algorithms (e.g.policy iteration,value iteration) converge to a unique fixed point.This usefulness comes in the form of a body of existing work in operator theory,which allows us to make use of special properties of the Bellman operators.

Specifically,the fact that the Bellman operators arecontractionsgives the useful results that,for any policy$\pi$and any initial vector$v$,

$$\lim_{k\to\infty}(\cal{T}^\pi)^k v = v_\pi \tag{10}$$

$$\lim_{k\to\infty}(\cal{T}^*)^k v = v_* \tag{11}$$

where$v_\pi$is the value of policy$\pi$and$v_*$is the value of an optimal policy$\pi^*$.The proof is due to thecontraction mapping theorem.

Feature-Based Methods for Large Scale Dynamic Programming(by John N.Tsitsiklis and Benjamin Van Roy,1996),An Analysis of Temporal-Difference Learning with Function Approximation(by John N.Tsitsiklis and Benjamin Van Roy,1997) andLeast-Squares Policy Iteration(by Michail G.Lagoudakis and Ronald Parr,2003). $\endgroup$– nbro Mar 10 at 22:59