在大多是强化学习(reinforcement learning RL)问题中,环境的model
都是未知的,也就无法直接做动态规划。一种方法是去学MDP
,在这个系列的理解强化学习中的策略迭代和值迭代
这篇文章中有具体思路。但这种做法还是会存在很多问题,就是在sample
过程中会比较麻烦,如果你随机sample
的话就会有某些state
你很难sample
到,而按照某种策略sample
的话,又很难得到真实的转移概率。一旦你的model
出现了问题,值迭代和策略迭代都将会出现问题。
于是就有了Model-free Reinforcement Learning
,直接与环境交互,直接从数据中学到model
。
Model-free Reinforcement Learning
Model-free Reinforcement Learning
需要从数据中estimate
出value
是多少(state or state-action pair),接下来拿到cumulative reward
的期望,得到这些case
之后,再去做model-free
的control
,去optimal
当前的policy
使得value function
最大化。
那model-free
的value function
如何来做prediction
呢?
在model-free
的RL
中我们无法获取state transition
和reward function
,我们仅仅是有一些episodes
。之前我们是拿这些episodes
学model
,在model free
的方法中拿这些episode
直接学value function
或者是policy
,不需要学MDP
。这里面两个关键的key steps
:1. estimate value function. 2. optimize policy.
Value Function Estimate
In model-based RL (MDP), the value function is calculated by dynamic programming
$$ v_{\pi}(s)=\mathbb{E_{\pi}}[R_{t+1}+\gamma v_{\pi}(\mathcal{S_{t+1}})|\mathcal{S_{t}=s}] $$
在model free
的方法中,我们不知道state transition
,由此无法计算上述等式的期望。
Monte-Carlo Methods
Monte-Carlo methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. For example, to calculate the circle's surface. As show in following figure:
对上述方框中均匀撒上一些点,然后用如下等式计算即可:
$$ \text{Circle Surface} = \text{Square Surface} \times \frac{\text{ points in circle}}{\text{points in total}} $$
Monte-Carlo Value Estimation
我们有很多episodes
,基于这些episode
,我们去计算total discounted reward
:
$$ G_{t}=R_{t+1}+\gamma R_{t+2}+\ldots=\sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1} $$
Value function
的 expected return
可表示为如下数学形式:
$$ V^{\pi}(s) = \mathbb{E} [G_{t}|s_{t}=s,\pi] \\ \approx \frac{1}{N} \sum_{i=1}^{N} G_{i}^{(i)} $$
上述方法可总结为两步:1. 使用policy
$\pi$从state
$s$开始采样 $N$ 个episodes
。2. 计算平均累计奖励(the average of cumulative reward )。可以看出来,这种基于采样的方法,直接一步到位,计算value
而不需要计算MDP
中的什么状态转移啥的。
上述思想更加细致、更具体的方法可用如下形式表示:
- Sample episodes of policy $\pi$。
Every time-step $t$ that state $s$ is visited in an episode
- Increment counter $N(s) \leftarrow N(s) +1$
- Increment total return $S(s) \leftarrow S(s) +G_{t}$
- Value is estimated by mean return $V(s)=S(s)/N(s)$
- By law of large numbers $V(s) \leftarrow V^{\pi}$ as $N(s) \rightarrow \infty$。
Incremental Monte-Carlo Updates
- Update $V(s)$ incrementally after each episode
- For each state $S_{t}$ with cumulative return $G_{t}$
$$ \begin{array}{l} {N\left(S_{t}\right) \leftarrow N\left(S_{t}\right)+1} \\ {V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\frac{1}{N\left(S_{t}\right)}\left(G_{t}-V\left(S_{t}\right)\right)} \end{array} $$
- For non-stationary problems (i.e. the environment could be varying over time), it can be useful to track a running mean, i.e. forget old episodes
如果环境的state transition
和reward function
一直在变,我们把这个环境叫做non-stationary
,环境本身肯定叫做stochastic
环境。但是如果分布不变,叫做statically environment
,但是环境本身的分布会发生变化的话,就需要去忘掉一些老的episode
,如果用平均的方法去做的话,老的episode
和新的episode
一样,它就忘不掉老的episode
。
$$ V(S_{t}) \leftarrow V(S_{t}) + \alpha (G_{t} - V(S_{t})) $$
Monte-Carlo Value Estimation
的一些特点:
- MC methods learn directly from episodes of experience
- MC is model-free: no knowledge of MDP transitions / rewards
- MC learns from complete episodes: no bootstrapping (discussed later)
- MC uses the simplest possible idea: value = mean return
- Caveat: can only apply MC to episodic MDPs i.e., all episodes must terminate
Temporal-Difference Learning
TD
的方法中引入对未来值函数的估计:
$$ G_{t}=R_{t+1}+\gamma R_{t+2}+\ldots=R_{t+1}+\gamma V(S_{t+1}) $$
$$ V(S_{t}) \leftarrow V(S_{t}) + \alpha(R_{t+1}+\gamma V(S_{t+1}) - V(S_{t})) $$
TD
的算法主要有以下四个特点:
- TD methods learn directly from episodes of experience
- TD is model-free: no knowledge of MDP transitions / rewards
- TD learns from incomplete episodes, by bootstrapping
- TD updates a guess towards a guess
Monte Carlo vs. Temporal Difference
Monte Carlo
方法和Temporal Difference
方法对比如下:
- The same goal: learn $V_{\pi}$ from episodes of experience under policy $\pi$。
Incremental every-visit Monte-Carlo
- Update value $V(S_{t})$ toward actual return $G_{t}$。
$$ V(S_{t}) \leftarrow V(S_{t}) + \alpha(G_{t}-V(S_{t})) $$
Simplest temporal-difference learning algorithm: TD
- Update value $V(S_{t})$ toward estimated return $R_{t+1} + \gamma V(S_{t+1})$。
- TD Target:$R_{t+1} + \gamma V(S_{t+1})$;
- TD error:$\delta = R_{t+1} + \gamma V(S_{t+1}) - V(S_{t})$
Advantages and Disadvantages of MC vs. TD
TD can learn before knowing the final outcome
- TD can learn online after every step
- MC must wait until end of episode before return is known
TD can learn without the final outcome
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments
Bias/Variance Trade-Off
- Return $G_{t}$ is unbiased estimate of $V^{\pi}(S_{t})$。
基于当前的策略去采样,然后计算平均值,这样得到的估计是无偏估计。
- TD target $R_{t+1} + \gamma V(S_{t+1})$ is biased estimate of $V^{\pi}$。
TD target
中由于存在对未来的估计$V(S_{t+1})$,这个估计如果是非常准确的,那TD target
也是unbiased estimate
,但是由于$V(S_{t+1})$很难估计准确,所以是 biased estimate
。
- TD target is of much lower variance than the return
TD target的方法一般比Return $G_{t}$要小。Return $G_{t}$ depends on many random actions, transitions and rewards;TD target depends on one random action, transition and reward
Advantages and Disadvantages of MC vs. TD (2)
- MC has high variance, zero bias
MC
方法具有好的 convergence properties (even with function approximation) 并且 Not very sensitive to initial value 但是需要 Very simple to understand and use。需要多采样去降低variance
。
- TD has low variance, some bias
TD
的方法 Usually more efficient than MC ,TD converges to $V^{\pi}(S_{t})$,but not always with function approximation。并且 More sensitive to initial value than MC。
n-step model-free prediction
For time constraint, we may jump n-step prediction section and directly head to model-free control
- Define the n-step return
$$ G_{t}^{n} = R_{t+1} + \gamma R_{t+2} + \cdots +\gamma^{n-1}R_{t+n} + \gamma^{n} V(S_{t+n}) $$
- n-step temporal-difference learning
$$ V(S_{t}) \leftarrow V(S_{t}) + \alpha(G_{t}^{(n)} - V(S_{t})) $$
有了值函数之后,我们就需要去做策略改进了。
我的微信公众号名称:深度学习与先进智能决策
微信公众号ID:MultiAgent1024
公众号介绍:主要研究分享深度学习、机器博弈、强化学习等相关内容!期待您的关注,欢迎一起学习交流进步!