1
$ \ $请将BeginGroup

To clarify it in my head, the value function calculates how 'good' it is to be in a certain state by summing all future (discounted) rewards, while the reward function is what the value function uses to 'generate' those rewards for it to use in the calculation of how 'good' it is to be in the state?

| 改善这个问题 | |
$\endgroup$

    1 Answer1

    1
    $ \ $请将BeginGroup

    我认为这是教学上有用的理论(公式)和实践(算法)来区分。

    如果你在谈论的价值函数的定义(理论)

    \开始{对齐}伏_ {\ PI}(一个或多个)&\点{=} \ mathbb {E} _ {\ PI} \左[G_T \中间S_T = S \右] \\&= \ mathbb {E}_ {\ PI} \左[\ sum_ {K = 0} ^ \ infty \伽马-1K-R_ {T + K + 1} \ bigl \ VERT S_T = S \右] \\ \ {端对齐}

    for all$ S \中\ mathcal {S} $,其中$\dot{=}$means "is defined as" and$ \ mathcal {S} $是状态空间,则该值功能可在奖励来定义,因为它上面可以清楚地看到。(注意$R_{t+k+1}$,$ G $$ S_T $是随机变量,而且,事实上,期望采取相对于随机变量)。

    The definition above can actually be expanded to be a Bellman equation (i.e. a recursive equation) defined in terms of the reward function$R(s, a)$的底层MDP。然而,通常情况下,而不是符号$R(s, a)$, you will see$ P(S',R \中旬S,A)$(which represents the combination of thetransition probability function和the奖励功能)。

    如果你估计值函数(实践),例如使用Q学习,你不一定用马尔可夫决策过程的奖励功能。您可以estimate通过只观察您会收到在探索环境的回报,而不会真正知道回报函数值函数。但是,通过探索环境,你其实可以估算回报功能。例如,如果每次你在状态$ S $you take action$ A $您会收到奖励$ R $,那么你已经了解一下实际的基础奖励功能。如果你探索够MDP,你可能学到的奖励功能太(除非它不断变化的,在这种情况下,它可能是比较难学的话)。

    To conclude, yes, value functions are certainly very related to reward functions and rewards, in ways that you immediately see from the equations that define the value functions.

    $\endgroup$

      Your Answer

      By clicking “Post Your Answer”, you agree to ourterms of service,privacy policycookie policy

      Not the answer you're looking for? Browse other questions tagged要么ask your own question.