【5分钟 Paper】Deterministic Policy Gradient Algorithms

论文题目：Deterministic Policy Gradient Algorithms

标题及作者信息

所解决的问题？

stochastic policy的方法由于含有部分随机，所以效率不高，方差大，采用deterministic policy方法比stochastic policy的采样效率高，但是没有办法探索环境，因此只能采用off-policy的方法来进行了。

背景

以往的action是一个动作分布$\pi_{\theta}(a|s)$，作者所提出的是输出一个确定性的策略(deterministic policy) $a =\mu_{\theta}(s)$。

In the stochastic case，the policy gradient integrates over both state and action spaces, whereas in the deterministic case it only integrates over the state space.

Stochastic Policy Gradient

前人采用off-policy的随机策略方法， behaviour policy $\beta(a|s) \neq \pi_{\theta}(a|s)$：

$$ \begin{aligned} J_{\beta}\left(\pi_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\beta}(s) V^{\pi}(s) \mathrm{d} s \\ &=\int_{\mathcal{S}} \int_{\mathcal{A}} \rho^{\beta}(s) \pi_{\theta}(a | s) Q^{\pi}(s, a) \mathrm{d} a \mathrm{d} s \end{aligned} $$

Differentiating the performance objective and applying an approximation gives the off-policy policy-gradient (Degris et al., 2012b)

$$ \begin{aligned} \nabla_{\theta} J_{\beta}\left(\pi_{\theta}\right) & \approx \int_{\mathcal{S}} \int_{\mathcal{A}} \rho^{\beta}(s) \nabla_{\theta} \pi_{\theta}(a | s) Q^{\pi}(s, a) \mathrm{d} a \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\beta}, a \sim \beta}\left[\frac{\pi_{\theta}(a | s)}{\beta_{\theta}(a | s)} \nabla_{\theta} \log \pi_{\theta}(a | s) Q^{\pi}(s, a)\right] \end{aligned} $$

This approximation drops a term that depends on the action-value gradient $\nabla_{\theta}Q^{\pi}(s,a)$; (Degris et al., 2012b)

$\mu_{\theta}(s)$ 更新公式：

$$ \theta^{k+1}=\theta^{k}+\alpha \mathbb{E}_{s \sim \rho^{\mu^{k}}} \left[\nabla_{\theta} Q^{\mu^{k}}\left(s, \mu_{\theta}(s)\right)\right] $$

引入链导法则：

$$ \theta^{k+1}=\theta^{k}+\alpha \mathbb{E}_{s \sim \rho^{\mu^{k}}} \left[\nabla_{\theta} \mu_{\theta}(s) \nabla_{a}Q^{\mu^{k}}\left(s, a\right) |_{a=\mu_{\theta}(s)} \right] $$

所采用的方法？

On-Policy Deterministic Actor-Critic

如果环境有大量噪声帮助智能体做exploration的话，这个算法还是可以的，使用sarsa更新critic，使用 $Q^{w}(s,a)$ 近似true action-value $Q^{\mu}$：

$$ \begin{aligned} \delta_{t} &=r_{t}+\gamma Q^{w}\left(s_{t+1}, a_{t+1}\right)-Q^{w}\left(s_{t}, a_{t}\right) \\ w_{t+1} &=w_{t}+\alpha_{w} \delta_{t} \nabla_{w} Q^{w}\left(s_{t}, a_{t}\right) \\ \theta_{t+1} &=\theta_{t}+\left.\alpha_{\theta} \nabla_{\theta} \mu_{\theta}\left(s_{t}\right) \nabla_{a} Q^{w}\left(s_{t}, a_{t}\right)\right|_{a=\mu_{\theta}(s)} \end{aligned} $$

Off-Policy Deterministic Actor-Critic

we modify the performance objective to be the value function of the target policy, averaged over the state distribution of the behaviour policy

$$ \begin{aligned} J_{\beta}\left(\mu_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\beta}(s) V^{\mu}(s) \mathrm{d} s \\ &=\int_{\mathcal{S}} \rho^{\beta}(s) Q^{\mu}\left(s, \mu_{\theta}(s)\right) \mathrm{d} s \end{aligned} $$

$$ \begin{aligned} \nabla_{\theta} J_{\beta}\left(\mu_{\theta}\right) & \approx \int_{\mathcal{S}} \rho^{\beta}(s) \nabla_{\theta} \mu_{\theta}(a | s) Q^{\mu}(s, a) \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\beta}} [\nabla_{\theta} \mu_{\theta}(s) \nabla_{a}Q^{\mu}(s,a)|_{a =\mu_{\theta}(s)}] \end{aligned} $$

得到off-policy deterministic actorcritic (OPDAC) 算法：

$$ \begin{aligned} \delta_{t} &=r_{t}+\gamma Q^{w}\left(s_{t+1}, \mu_{\theta}\left(s_{t+1}\right)\right)-Q^{w}\left(s_{t}, a_{t}\right) \\ w_{t+1} &=w_{t}+\alpha_{w} \delta_{t} \nabla_{w} Q^{w}\left(s_{t}, a_{t}\right) \\ \theta_{t+1} &=\theta_{t}+\left.\alpha_{\theta} \nabla_{\theta} \mu_{\theta}\left(s_{t}\right) \nabla_{a} Q^{w}\left(s_{t}, a_{t}\right)\right|_{a=\mu_{\theta}(s)} \end{aligned} $$

与stochastic off policy算法不同的是由于这里是deterministic policy，所以不需要用重要性采样(importance sampling)。

取得的效果？

实验结果

所出版信息？作者信息？

这篇文章是ICML2014上面的一篇文章。第一作者David Silver是Google DeepMind的research Scientist，本科和研究生就读于剑桥大学，博士于加拿大阿尔伯特大学就读，2013年加入DeepMind公司，AlphaGo创始人之一，项目领导者。

David Silver

参考链接

参考文献：Degris, T., White, M., and Sutton, R. S. (2012b). Linear off-policy actor-critic. In 29th International Conference on Machine Learning.

扩展阅读

假定真实的action-value function为 $Q^{\pi}(s,a)$，用一个function近似它 $Q^{w}(s,a) \approx Q^{\pi}(s,a)$。However, if the function approximator is compatible such that 1. $Q^{w}(s, a)=\nabla_{\theta} \log \pi_{\theta}(a | s)^{\top} w$ (linear in "fearure") 2. the parameters $w$ are chosen to minimise the mean-squared error $\varepsilon^{2}(w) = \mathbb{E}_{s \sim \rho^{\pi},a \sim \pi_{\theta}}[(Q^{w}(s,a)-Q^{\pi}(s,a))^{2}]$ (linear regression problem form these feature )，then there is no bias (Sutton et al., 1999),

$$ \nabla_{\theta} J\left(\pi_{\theta}\right)=\mathbb{E}_{s \sim \rho^{\pi}, a \sim \pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(a | s) Q^{w}(s, a)\right] $$

最后，论文给出了DPG的采用线性函数逼近定理，以及一些理论证明基础。

参考文献：Sutton, R.S., McAllester D. A., Singh, S. P., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. In Neural Information Processing Systems 12, pages 1057–1063.

这篇文章以后有时间再读一遍吧，里面还是有些证明需要仔细推敲一下。

我的微信公众号名称：深度学习与先进智能决策
微信公众号ID：MultiAgent1024
公众号介绍：主要研究分享深度学习、机器博弈、强化学习等相关内容！期待您的关注，欢迎一起学习交流进步！