强化学习中策略梯度算法

在这里插入图片描述

  在强化学习中的值函数近似算法文章中有说怎么用参数方程去近似state value ,那policy能不能被parametrize呢?其实policy可以被看成是从stateaction的一个映射$a \leftarrow \pi(s)$,

Parametric Policy

  我们可以参数化一个策略$\pi_{\theta}(a|s)$,它也可以变成一个确定性的策略,数学表示为$a = \pi_{\theta}(s)$,或者表示为随机的概率形式$\pi_{\theta}(a|s) = P(a|s;\theta)$,在stochastic policy里面,参数化的policy输出的就是action的概率分布。其中$\theta$表示为policy的参数,用参数化近似policy可以使得整个模型具备更强的泛化能力(Generalize from seen states to unseen states )。

Policy-based RL

  Policy-based RL 相对于Value-based RL会有比较好的性质,比如:

  • Advantages

    • Better convergence properties (虽然policy每次都改进一点点,但是它总是朝着好的方向进行改进,而值函数的方法是会有可能围绕最优价值函数持续小的震荡而不收敛。)
    • 对于value function的方法, 我们会去取max操作 (值函数需要取到下个状态$s_{t+1}$下选取哪个动作能够使得值函数最大),基于policy的方法的的效率在continuous action space上会比较高。
    • Can learn stochastic polices (值函数的方法中都是取max 或者贪婪策略,而policy的方法就可以采用分布的思想)。
  • Disadvantages

    • Typically converge to a local rather than global optimum. (你能得到linear model上面的全局最优,但是得不到像神经网络这种空间上的全局最优,但往往复杂模型上的local optimal也比linear model上的global optimal要好。)
    • Evaluating a policy is typically inefficient and of high variance. (由于算法存在sample的操作,和对下一个值函数的估计,因此方差也会比较高。)

stochastic policy

  • For stochastic policy $\pi_{\theta}(a)|s = P(a|s;\theta)$
  • Intuition

    • lower the probability of the action that leads to low value/reward
    • higher the probability of the action that leads to high value/reward

在这里插入图片描述

  上述的过程就是:如果一个action能够获得更多的奖励,那么这个action会被加强,否者将会被削弱。这也是行为主义的思想。

Policy Gradient in One-Step MDPs

  考虑这样一个环境:One-Step MDPs,在状态$s \sim d(s)$,Terminating after one time-step with reward $r_{sa}$。

  此时Policy expected value可表达为如下形式:

$$ J(\theta) = \mathbb{E_{\pi_{\theta}}}[r] = \sum_{s \in S}d(s)\sum_{a \in A}\pi_{\theta}(a|s)r_{sa} $$

  如果需要对参数 $\theta$ 求偏导数的话,可以表达为如下形式:

$$ \frac{\partial J(\theta)} {\partial \theta} = \sum_{a \in S} d(s) \sum_{a \in A} \frac{\partial \pi_{\theta}(a|s)}{\partial \theta} r_{sa} $$

Likelihood Ratio

  那 $\pi_{\theta}$ 是一个distribution,我们怎么来对一个distribution来求导呢?

  这种方法在数学上是一种 tick,叫 Likelihood Ratio:

  • Likelihood ratios exploit the following identity

  我们首先用一个完全衡等的数学公式去表达它:

$$ \begin{aligned} \frac{\partial \pi_{\theta}(a | s)}{\partial \theta} &=\pi_{\theta}(a | s) \frac{1}{\pi_{\theta}(a | s)} \frac{\partial \pi_{\theta}(a | s)}{\partial \theta} \\ &=\pi_{\theta}(a | s) \frac{\partial \log \pi_{\theta}(a | s)}{\partial \theta} \end{aligned} $$

  Thus the policy’s expected value

在这里插入图片描述

  上述过程还是One-Step MDPs 的过程。

Policy Gradient Theorem

  但强化学习还是会存在非常多步的MDP的情况。并且如果我们将及时奖励 $r_{sa}$ 换成 value function上述过程仍然会成立。

  • The policy gradient theorem generalizes the likelihood ratio approach to multi-step MDPs

    • Replaces instantaneous reward $r_{sa}$ with long-term value $Q^{\pi_{\theta}}(s,a)$。
  • Policy gradient theorem applies to

    • start state objective $J_{1}$,average reward objective $J_{avR}$, and average value objective $J_{avV}$。
  • Theorem

    • For any differentiable policy $\pi_{\theta}(a|s)$,for any of policy objective function $J=J_{1}, J_{avR},J_{avV}$,the policy gradient is:

$$ \frac{\partial J(\theta)}{\partial \theta} = \mathbb{E}\pi_{\theta}[\frac{\partial \text{log} \pi_{\theta}(a|s)}{\partial \theta}Q^{\pi_{\theta}}(s,a)] $$

  上述定理说的就是策略 $\pi$ 里面含有参数 $\theta$,而 $Q^{\pi_{\theta}}(s,a)$ 也与参数 $\theta$ 有关,那为什么不对 $Q^{\pi_{\theta}}(s,a)$ 也求导呢?定理说的就是对任何的policy objective function $J=J_{1}, J_{avR},J_{avV}$都会有上述等式成立。

  • 证明

  我们先定义baseline $J(\pi)$,follow 当前的策略$\pi$,与环境进行互动,所得到的奖励的平均定义为 $J_{avR}(\pi)$ :

$$ J(\pi)=\lim _{n \rightarrow \infty} \frac{1}{n} \mathbb{E}\left[r_{1}+r_{2}+\cdots+r_{n} | \pi\right]=\sum_{s} d^{\pi}(s) \sum_{a} \pi(a | s) r(s, a) $$

  就是每一个 episode 里面的每一个 time step 所得到的奖励的平均值的期望。

  $d^{\pi}(s)$表示的是在当前策略下$s$被采样得到的概率。

  state action value 可表示为如下形式:

$$ Q^{\pi}(s, a)=\sum_{t=1}^{\infty} \mathbb{E}\left[r_{t}-J(\pi) | s_{0}=s, a_{0}=a, \pi\right] $$

  可以推导出:

$$ \begin{aligned} \frac{\partial V^{\pi}(s)}{\partial \theta} & \stackrel{\text { def }}{=} \frac{\partial}{\partial \theta} \sum_{a} \pi(a | s) Q^{\pi}(s, a), \quad \forall s \\ &=\sum_{a}\left[\frac{\partial \pi(a | s)}{\partial \theta} Q^{\pi}(s, a)+\pi(a | s) \frac{\partial}{\partial \theta} Q^{\pi}(s, a)\right] \\ &=\sum_{a}\left[\frac{\partial \pi(a | s)}{\partial \theta} Q^{\pi}(s, a)+\pi(a | s) \frac{\partial}{\partial \theta}\left(r(s, a)-J(\pi)+\sum_{s^{\prime}} P_{s s^{\prime}}^{a} V^{\pi}\left(s^{\prime}\right)\right)\right] \\ &=\sum_{a}\left[\frac{\partial \pi(a | s)}{\partial \theta} Q^{\pi}(s, a)+\pi(a | s)\left(-\frac{\partial J(\pi)}{\partial \theta}+\frac{\partial}{\partial \theta} \sum_{s^{\prime}} P_{s s^{\prime}}^{a} V^{\pi}\left(s^{\prime}\right)\right)\right] \\ \Rightarrow \frac{\partial J(\pi)}{\partial \theta} &=\sum_{a}\left[\frac{\partial \pi(a | s)}{\partial \theta} Q^{\pi}(s, a)+\pi(a | s) \sum_{s^{\prime}} P_{s s^{\prime}}^{a} \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}\right]-\frac{\partial V^{\pi}(s)}{\partial \theta} \end{aligned} $$

  也就是得到:

$$ \frac{\partial J(\pi)}{\partial \theta} =\sum_{a}\left[\frac{\partial \pi(a | s)}{\partial \theta} Q^{\pi}(s, a)+\pi(a | s) \sum_{s^{\prime}} P_{s s^{\prime}}^{a} \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}\right]-\frac{\partial V^{\pi}(s)}{\partial \theta} $$

  之后我们需要运用一个简单的变换,

  我们可以对$\frac{\partial J(\pi)}{\partial \theta}$做一个状态$s$的求和,因为$\frac{\partial J(\pi)}{\partial \theta}$已经加和了所有的$s$和$a$。所以$\frac{\partial J(\pi)}{\partial \theta} = \sum_{s}d^{\pi}(s)\frac{\partial J(\pi)}{\partial \theta}$,由此可以得到:

$$ \sum_{s} d^{\pi}(s) \frac{\partial J(\pi)}{\partial \theta}=\sum_{s} d^{\pi}(s) \sum_{a} \frac{\partial \pi(a | s)}{\partial \theta} Q^{\pi}(s, a)+\sum_{s} d^{\pi}(s) \sum_{a} \pi(a | s) \sum_{s^{\prime}} P_{s s^{\prime}}^{a} \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}-\sum_{s} d^{\pi}(s) \frac{\partial V^{\pi}(s)}{\partial \theta} $$

  而后面$\sum_{s} d^{\pi}(s) \sum_{a} \pi(a | s) \sum_{s^{\prime}} P_{s s^{\prime}}^{a} \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}-\sum_{s} d^{\pi}(s) \frac{\partial V^{\pi}(s)}{\partial \theta}$这一项其实是等于0的,其证明如下:

$$ \begin{aligned} &\sum_{s} d^{\pi}(s) \sum_{a} \pi(a | s) \sum_{s^{\prime}} P_{s s^{\prime}}^{a} \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}=\sum_{s} \sum_{a} \sum_{s^{\prime}} d^{\pi}(s) \pi(a | s) P_{s s^{\prime}}^{a} \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}\\ &=\sum_{s} \sum_{s^{\prime}} d^{\pi}(s)\left(\sum_{a} \pi(a | s) P_{s s^{\prime}}^{a}\right) \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}=\sum_{s} \sum_{s^{\prime}} d^{\pi}(s) P_{s s^{\prime}} \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}\\ &=\sum_{s^{\prime}}\left(\sum_{s} d^{\pi}(s) P_{s s^{\prime}}\right) \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}=\sum_{s^{\prime}} d^{\pi}\left(s^{\prime}\right) \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta} \end{aligned} $$

  所以:

$$ \begin{aligned} &\Rightarrow \sum_{s} d^{\pi}(s) \frac{\partial J(\pi)}{\partial \theta}=\sum_{s} d^{\pi}(s) \sum_{a} \frac{\partial \pi(a | s)}{\partial \theta} Q^{\pi}(s, a)+\sum_{s^{\prime}} d^{\pi}\left(s^{\prime}\right) \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta}-\sum_{s} d^{\pi}(s) \frac{\partial V^{\pi}(s)}{\partial \theta}\\ &\Rightarrow \frac{\partial J(\pi)}{\partial \theta}=\sum_{s} d^{\pi}(s) \sum_{a} \frac{\partial \pi(a | s)}{\partial \theta} Q^{\pi}(s, a) \end{aligned} $$

Monte-Carlo Policy Gradient (REINFORCE)

  那这个 $Q$ value function 怎么计算?最简单的方式就是蒙特卡洛采样:

  Using return $G_{t}$ as an unbiased sample of $Q^{\pi_{\theta}}(s,a)$:

$$ \Delta \theta_{t} = \alpha \frac{\partial \text{log} \pi_{\theta}(a_{t}|s_{t})}{\partial \theta} G_{t} $$

  如果用另外一个模型去approximate $Q^{\pi_{\theta}}(s,a)$,那就叫做Actor-Critic算法。

Softmax Stochastic Policy

  那$\frac{\partial \text{log}\pi_{\theta}(a|s)}{\partial \theta}$怎么求呢?

  比如我们用Softmax去构建policy

  • Softmax policy is a very commonly used stochastic policy

$$ \pi_{\theta}(a | s)=\frac{e^{f_{\theta}(s, a)}}{\sum_{a^{\prime}} e^{f_{\theta}\left(s, a^{\prime}\right)}} $$

  其中$f_{\theta}(s,a)$是 state-action pairscore function,parametrized by $\theta$, which can be defined with domain knowledge。

  • The gradient of its log-likelihood:

$$ \begin{aligned} \frac{\partial \log \pi_{\theta}(a | s)}{\partial \theta} &=\frac{\partial f_{\theta}(s, a)}{\partial \theta}-\frac{1}{\left.\sum_{a^{\prime}} e^{f_{\theta}, a^{\prime}}\right)} \sum_{a^{\prime \prime}} e^{f_{\theta}\left(s, a^{\prime \prime}\right)} \frac{\partial f_{\theta}\left(s, a^{\prime \prime}\right)}{\partial \theta} \\ &=\frac{\partial f_{\theta}(s, a)}{\partial \theta}-\mathbb{E}_{a^{\prime} \sim \pi_{\theta}\left(a^{\prime} | s\right)}\left[\frac{\partial f_{\theta}\left(s, a^{\prime}\right)}{\partial \theta}\right] \end{aligned} $$

  • For example, we define the linear score function

$$ \begin{aligned} f_{\theta}(s, a) &=\theta^{\top} x(s, a) \end{aligned} $$

$$ \begin{aligned} \frac{\partial \log \pi_{\theta}(a | s)}{\partial \theta} &=\frac{\partial f_{\theta}(s, a)}{\partial \theta}-\mathbb{E}_{a^{\prime} \sim \pi_{\theta}\left(a^{\prime} | s\right)}\left[\frac{\partial f_{\theta}\left(s, a^{\prime}\right)}{\partial \theta}\right] \\ &=x(s, a)-\mathbb{E}_{a^{\prime} \sim \pi_{\theta}\left(a^{\prime} | s\right)}\left[x\left(s, a^{\prime}\right)\right] \end{aligned} $$

APPENDIX

Policy gradient theorem: Start Value Setting

  • Start state value objective

$$ \begin{aligned} J(\pi) &=\mathbb{E}\left[\sum_{t=1}^{\infty} \gamma^{t-1} r_{t} | s_{0}, \pi\right] \\ Q^{\pi}(s, a) &=\mathbb{E}\left[\sum_{k=1}^{\infty} \gamma^{k-1} r_{t+k} | s_{t}=s, a_{t}=a, \pi\right] \\ \end{aligned} $$

$$ \begin{aligned} \frac{\partial V^{\pi}(s)}{\partial \theta} & \stackrel{\text { def }}{=} \frac{\partial}{\partial \theta} \sum_{a} \pi(s, a) Q^{\pi}(s, a), \quad \forall s \\ &=\sum_{a}\left[\frac{\partial \pi(s, a)}{\partial \theta} Q^{\pi}(s, a)+\pi(s, a) \frac{\partial}{\partial \theta} Q^{\pi}(s, a)\right] \\ &=\sum_{a}\left[\frac{\partial \pi(s, a)}{\partial \theta} Q^{\pi}(s, a)+\pi(s, a) \frac{\partial}{\partial \theta}\left(r(s, a)+\sum_{s^{\prime}} \gamma P_{s s^{\prime}}^{a} V^{\pi}\left(s^{\prime}\right)\right)\right] \\ &=\sum_{a} \frac{\partial \pi(s, a)}{\partial \theta} Q^{\pi}(s, a)+\sum_{a} \pi(s, a) \gamma \sum_{s^{\prime}} P_{s s^{\prime}}^{a} \frac{\partial V^{\pi}\left(s^{\prime}\right)}{\partial \theta} \end{aligned} $$

$$ \begin{aligned} \frac{\partial V^{\pi}(s)}{\partial \theta}=\sum_{a} \frac{\partial \pi(s, a)}{\partial \theta} Q^{\pi}(s, a)+\sum_{a} \pi(s, a) \gamma \sum_{a} P_{s s_{1}}^{a} \frac{\partial V^{\pi}\left(s_{1}\right)}{\partial \theta} \end{aligned} $$

$$ \begin{aligned} \sum_{a} \frac{\partial \pi(s, a)^{a}}{\partial \theta} Q^{\pi}(s, a)=& \gamma^{0} \operatorname{Pr}(s \rightarrow s, 0, \pi) \sum_{a} \frac{\partial \pi(s, a)}{\partial \theta} Q^{\pi}(s, a) \\ \sum_{a} \pi(s, a) \gamma \sum_{s_{1}} P_{s s_{1}}^{a} \frac{\partial V^{\pi}\left(s_{1}\right)}{\partial \theta} &=\sum_{s_{1}} \sum_{a} \pi(s, a) \gamma P_{s s_{1}}^{a} \frac{\partial V^{\pi}\left(s_{1}\right)}{\partial \theta} \\ &=\sum_{s_{1}} \gamma P_{s s_{1}} \frac{\partial V^{\pi}\left(s_{1}\right)}{\partial \theta}=\gamma^{1} \sum_{s_{1}} \operatorname{Pr}\left(s \rightarrow s_{1}, 1, \pi\right) \frac{\partial V^{\pi}\left(s_{1}\right)}{\partial \theta} \end{aligned} $$

$$ \frac{\partial V^{\pi}\left(s_{1}\right)}{\partial \theta}=\sum_{a} \frac{\partial \pi(s, a)}{\partial \theta} Q^{\pi}(s, a)+\gamma^{1} \sum_{s_{2}} \operatorname{Pr}\left(s_{1} \rightarrow s_{2}, 1, \pi\right) \frac{\partial V^{\pi}\left(s_{2}\right)}{\partial \theta} $$

在这里插入图片描述

我的微信公众号名称:深度学习与先进智能决策
微信公众号ID:MultiAgent1024
公众号介绍:主要研究分享深度学习、机器博弈、强化学习等相关内容!期待您的关注,欢迎一起学习交流进步!
推荐阅读
关注数
287
内容数
36
主要研究分享深度学习、机器博弈、强化学习等相关内容!公众号:深度学习与先进智能决策
目录
极术微信服务号
关注极术微信号
实时接收点赞提醒和评论通知
安谋科技学堂公众号
关注安谋科技学堂
实时获取安谋科技及 Arm 教学资源
安谋科技招聘公众号
关注安谋科技招聘
实时获取安谋科技中国职位信息