Reinforcement Learning with Code 【Chapter 10. Actor Critic】-Toy模板网

这篇具有很好参考价值的文章主要介绍了Reinforcement Learning with Code 【Chapter 10. Actor Critic】。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

Reinforcement Learning with Code 【Chapter 10. Actor Critic】

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning.
This code refers to Mofan’s reinforcement learning course.

10.1 The simplest actor-critic algorithm (QAC)

Recall the idea of policy gradient method is to search for an optimal policy by maximizing a scalar metric $J(\theta)$ . The metric has three options, average state value $\mathbb{E}[v_\pi(S)]$ , average one step reward $\mathbb{E}[r_\pi(S)]$ or average state value from a specific state $s_0$ 。

According to policy gradient theorem in chapter 9, we are informed that

$\begin{aligned} \theta_{t+1} & = \theta_t + \alpha \nabla_\theta J(\theta_t) \\ & = \theta_t + \alpha \mathbb{E}_{S\sim\eta, A\sim\pi} [\nabla_\theta \ln \pi(A|S,\theta_t)q_\pi(S,A)] \end{aligned}$

where $\eta$ is a distribution of the states. Since the true gradient is unknwon, we can use a stochastic gradient ot approximated it, hence we have

$\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln\pi(a_t|s_t,\theta_t)q_t(s_t,a_t)$

In policy gradient method such as REINFORCE, we use the idea of Monte-Carlo to approximate the true value $q_t(s_t,a_t)$ . Where $q_t(s_t,a_t)$ is approximated by an episode return that is $q_t(s_t,a_t)=\sum_{k=t+1}^T \gamma^{k-t-1}r$ .

If $q_t(s_t,a_t)$ is estimated by value function approximationg, and the value function is updated using the idea of TD learning. The corresponding algorithms are usually called actor-critic. Therefore, actor-critic methods can be seen as one of the policy gradient method.

When we use the parameterized value function $q (s, a; w)$ to approximate the $q_t(s_t,a_t)$ , and the value function is updated by the idea of Sarsa of TD learning, the algorithm is called Q actor-critic (QAC). The core idea of QAC is that

$\text{QAC:} \left\{ \begin{aligned} \text{Actor}: \theta_{t+1} & = \theta_t + \alpha_\theta \nabla_\theta \ln\pi(a_t|s_t;\theta) q(s_t,a_t;w_t) \\ \text{Critic}: w_{t+1} & = w_t + \alpha_w [r_{t+1}+\gamma q(s_{t+1},a_{t+1};w_t) - q(s_t,a_t;w_t)]\nabla_w q(s_t,a_t;w_t) \end{aligned} \right.$

We use value function approximation to approximate true q value $q_t(s_t,a_t)$ . Meanwhile we use the idea of Sarsa to update our value function.

We can write the objective function of the update rule

$\text{QAC:} \left \{ \begin{aligned} \textcolor{red}{\text{Actor}: \max_\theta J(\theta)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta, A\sim\pi}[\ln\pi(A|S;\theta_t)q(S,A;w_t)]} \\ \textcolor{red}{\text{Critic}: \min_w J(w)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta, A\sim\pi}[(R + \gamma q(S^\prime,A;w_t) - q(S_t,A;w_t))^2]} \end{aligned} \right.$

Pesudocode

Reinforcement Learning with Code 【Chapter 10. Actor Critic】,Reinforcement Learning,python,人工智能

10.2 Advantage Actor-Critic (A2C)

The core idea of A2C is to introduce a baseline to reduce estimation variance. That is
$\mathbb{E}_{S\sim\eta,A\sim\pi}[\nabla_\theta \ln \pi(A|S;\theta_t)q_\pi(S,A)] = \mathbb{E}_{S\sim\eta,A\sim\pi}[\nabla_\theta \ln \pi(A|S;\theta_t)(q_\pi(S,A)-b(S))]$
where the additional baseline $b (S)$ is a scalar function of $S$ . Add a baseline doesn’t affect the expectation of the above equation that is
$\begin{aligned} \mathbb{E}_{S\sim\eta,A\sim\pi}[\nabla_\theta \ln \pi(A|S;\theta_t)b(S)] & = 0 \\ & = \sum_{s\in\mathcal{S}}\eta(s)\sum_{a\in\mathcal{A}}\pi(a|s;\theta_t) \nabla_\theta \ln \pi(a|s;\theta_t)b(s) \\ & = \sum_{s\in\mathcal{S}}\eta(s)\sum_{a\in\mathcal{A}}\nabla_\theta\pi(a|s;\theta_t)b(s) \\ & = \sum_{s\in\mathcal{S}}\eta(s)b(s)\sum_{a\in\mathcal{A}}\nabla_\theta\pi(a|s;\theta_t) \\ & = \sum_{s\in\mathcal{S}}\eta(s)b(s)\nabla_\theta\pi\sum_{a\in\mathcal{A}}(a|s;\theta_t) \\ & = \sum_{s\in\mathcal{S}}\eta(s)b(s)\nabla_\theta1 = 0 \end{aligned}$

How to find the optimal baseline? The derivation is omitted. The optimal baseline is

$b^*(s) = \frac{\mathbb{E}_{A\sim\pi}[||\nabla_\theta \ln\pi(A|s;\theta_t)||^2 q_\pi(s,A)]}{\mathbb{E}_{A\sim\pi}[||\nabla_\theta \ln\pi(A|s;\theta_t)||^2]}$

But its too complex to use in practice. If the weight $||\nabla_\theta \ln\pi(A|s;\theta_t)||^2$ is removed, we can obtain a suboptimal baseline that has a concise expression:

$\textcolor{red}{b^\dagger (s) = \mathbb{E}[q_\pi(s,A)] = v_\pi(s)}$

The suboptimal baseline is the state value of state $s$ .

When $b(s)=v_\pi(s)$ , the gradient-ascent becomes
$\begin{aligned} \theta_{t+1} & = \theta_t + \alpha\mathbb{E}[\nabla_\theta \ln\pi(A|S;\theta_t)[q_\pi(S,A)-v_\pi(S)]] \\ & = \theta_t + \alpha\mathbb{E}[\nabla_\theta\ln\pi(A|S;\theta_t) \delta_\pi(S,A)] \end{aligned}$
Here,
$\textcolor{red}{\delta_\pi(S,A) = q_\pi(S,A) - v_\pi(S)}$
is called advantage function, which reflects the advantage of one action over the others. More specifically, note that $v_\pi(s)=\sum_{a\in\mathcal{A}}\pi(a|s)q_\pi(s,a)$ is the mean of the action value. If $\delta_\pi(s,a)>0$ , it means that the corresponding action has a greater value than the mean value.

The stochastic version is
$\begin{aligned} \theta_{t+1} & = \theta_t + \alpha\nabla_\theta \ln\pi(a_t|s_t;\theta_t)[q_t(s_t,a_t)-v_t(s_t)] \\ & = \theta_t + \alpha\nabla_\theta\ln\pi(a_t|s_t;\theta_t) \delta_t(s_t,a_t) \end{aligned}$
we need to estimate the true q-value $q_t(s_t,a_t)$ . There are many ways:

If $q_t(s_t,a_t)$ and $v_t(s_t)$ are estimated by Monte-Carlo learning, the algorithm is called REINFORCE with a baseline.

If $q_t(s_t,a_t)$ and $v_t(s_t)$ are estimated by TD learning, the algorithm is usually called advantage actor-critic (A2C).

$\begin{aligned} q_t(s_t,a_t) - v_t(s_t) & = r_{t+1} +\gamma q_t(s_{t+1},a_{t+1}) - v_t(s_t) \\ & \textcolor{red}{\approx r_{t+1} +\gamma v_t(s_{t+1}) - v_t(s_t)} \end{aligned}$

Hence, we don’t need to maintain two networks to represent $v_\pi(s)$ and $q_\pi(s,a)$ . We just need one network to represent $v_\pi(s)$ .

In A2C we use one policy network $\pi(a|s;\theta)$ and one state value network $v (s; w)$ . The core idea of A2C is that

$\text{A2C}: \left \{ \begin{aligned} \text{Advantage}: \delta_t & = r_{t+1} + \gamma v(s_{t+1};w_t) - v(s_t;w_t) \\ \text{Actor}: \theta_{t+1} & = \theta_t + \alpha_\theta \textcolor{blue}{\delta_t}\nabla_\theta \ln\pi(a_t|s_t;\theta) \\ \text{Critic}: w_{t+1} & = w_t + \alpha_w \textcolor{blue}{\delta_t} \nabla_w v(s_t,;w_t) \end{aligned} \right.$

We can write the objective function of the update rule

$\text{A2C}: \left\{ \begin{aligned} \text{Advantage:} \Delta(S) & = R+\gamma v(S^\prime;w_t) - v(S;w_t) \\ \textcolor{red}{\text{Actor}: \max_\theta J(\theta)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta, A\sim\pi}[\ln\pi(A|S;\theta)\Delta(S)]} \\ \textcolor{red}{\text{Critic}: \min_w J(w)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta}[(R + \gamma v(S^\prime;w) - v(S;w))^2]} = \mathbb{E}_{S\sim_\eta}[\Delta(S)] \end{aligned} \right.$

Pesudocode

10.3 Off-policy Actor-Critic

Importance Sampling

The key technique to convert the AC algorithm to off-policy is importance sampling. Consider a random variable $X\in\mathcal{X}$ . Suppose that $p_0(X)$ is a probability distribution. Our goal is to estimate $\mathbb{E}_{X\sim p_0}[X]$ . We also known the $p_1(X)$ is a probability distribution of $X$ . How can we use the probability $p_1(X)$ to sample data to estimate $\mathbb{E}_{X\sim p_0}[X]$ . The technique is importance sampling. Suppose we have some i.i.d. samples $\{x_i\}^n_{i=1}$ generated by distribution $p_1(X)$ .
$\mathbb{E}_{X\sim p_0}[X] = \sum_{x\in\mathcal{X}}p_0(x)x = \sum_{x\in\mathcal{X}}p_1(x)\underbrace{\frac{p_0(x)}{p_1(x)}x}_{f(x)} = \mathbb{E}_{X\sim p_1}[f(X)] \\ \mathbb{E}_{X\sim p_0}[X] = \mathbb{E}_{X\sim p_1}[f(X)] \approx \bar{f} = \frac{1}{n} \sum^n_{i=1}f(x_i) = \frac{1}{n} \sum^n_{i=1} \underbrace{\frac{p_0(x_i)}{p_1(x_i)}}_{\text{importance weight}}x_i$
An Example

Consider $X\in\mathcal{X}={+1,-1}$ Suppose the $p_0$ is a probability distribution satisfying
$p_0(X=+1)=0.5, p_0(X=-1)=0.5$
The expectaton of $X$ over $p_0$ is
$\mathbb{E}_{X\sim p_0}[X] = (+1)\times 0.5 + (-1) \times 0.5 = 0$
Suppose $p_1$ is a probability distribution satisfying
$p_0(X=+1)=0.8, p_0(X=-1)=0.2$
The expectaton of $X$ over $p_0$ is
$\mathbb{E}_{X\sim p_1}[X] = (+1)\times 0.8 + (-1) \times 0.2 = 0.6$
We can use the importance samping techique to sample data under distribution $p_1$ to compute $\mathbb{E}_{X\sim p_0}[X]$
$\mathbb{E}_{X\sim p_0}[X] = \frac{1}{n}\sum_{i=1}^n \frac{p_0(x_i)}{p_1(x_i)}x_i$

import numpy as np
import matplotlib.pyplot as plt
# reproducible
np.random.seed(0)

# 定义元素和对应的概率
elements = [1, -1]
probs1 = [0.5, 0.5]
probs2 = [0.8, 0.2]

# 重要性采样 importance sample
sample_times = 300
sample_list = []
i_sample_list = []
average_list = []
importance_list = []
for i in range(sample_times):
    sample = np.random.choice(elements, p=probs2)
    sample_list.append(sample)
    average_list.append(np.mean(sample_list))
    if sample == elements[0]:
        i_sample_list.append(probs1[0] / probs2[0] * sample)
    elif sample == elements[1]:
        i_sample_list.append(probs1[1] / probs2[1] * sample)
    importance_list.append(np.mean(i_sample_list))



plt.plot(range(len(sample_list)), sample_list, 'o', markerfacecolor='none', label='sample data')
plt.plot(range(len(average_list)), average_list, 'b--', label='average')
plt.plot(range(len(importance_list)), importance_list, 'g--', label='importance sampling')
plt.axhline(y=0.6, color='r', linestyle='--')
plt.axhline(y=0, color='r', linestyle='--')
plt.ylim(-1.5, 2.5) # 限制y轴显示范围
plt.xlim(0,sample_times) # 限制x轴显示范围
plt.legend(loc='upper right')
plt.show()

Off-policy policy gradient theorem

With importance sampling, we are ready to present the off-policy gradient theorem. Suppose that the $\beta$ is a behavior policy. Our goal is to use the samples generated by behavoir policy $\beta$ to learn a target policy $\pi$ that can maximize the following metric
$\max_\theta J(\theta) = \mathbb{E}_{S\sim d_\beta}[v_\pi(S)]$
Theorem 10.1 (Stochastic off-policy policy gradient theorem). In the discounted case where $\gamma\in(0,1)$ , the gradient of $J(\theta)$ is
$\textcolor{red}{\nabla_\theta J(\theta) = \mathbb{E}_{S\sim\rho, A\sim\beta}\Big[\underbrace{\frac{\pi(A|S;\theta)}{\beta(A|S)}}_{\text{importance weight}} \nabla_\theta \ln \pi(A|S;\theta) q_\pi(S,A) \Big]}$
where the state distribution $\rho$ is
$\rho(s) \triangleq \sum_{s^\prime\in\mathcal{S}} d_\beta(s^\prime) \Pr_\pi(s|s^\prime)$
where $\Pr_\pi(s|s^\prime)=\sum_{k=0}^\infty \gamma^k[P^k_\pi]_{s^\prime,s}=[(I-\gamma P_\pi)^{-1}]_{s^\prime,s}$ is the discounted total probability of transitioning from $s^\prime$ to $s$ under policy $\pi$ .

The off-policy policy gradient is invariant to any additional baseline $b (s)$ . In particular, we have
$\nabla_\theta J(\theta) = \mathbb{E}_{S\sim\rho, A\sim\beta}\Big[\frac{\pi(A|S;\theta)}{\beta(A|S)} \nabla_\theta \ln \pi(A|S;\theta) \big( q_\pi(S,A) - b(S) \big) \Big]$
when we take the state value as the baseline $v_\pi(S)=b(S)$ , there comes the advantage function.
$\delta_\pi(S,A) = q_\pi(S,A) - v_\pi(S)$
The corresponding stochastic gradient-ascent algorithm is
$\theta_{t+1} = \theta_t + \alpha_\theta \frac{\pi(a_t|s_t;\theta)}{\beta(a_t|s_t)} \nabla_\theta \ln\pi(a_t|s_t;\theta_t)(q_t(s_t,a_t)-v_t(s_t))$
The advantage function can be replaced by the TD error. That is
$q_t(s_t,a_t)-v_t(s_t) \approx r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) \triangleq \delta_t(s_t,a_t)$
In off-policy A2C we use the behavior policy $\beta$ to obtain samples, to learn a policy network $\pi(a|s;\theta)$ , and one value network $v (s; w)$ . The core idea of off-policy A2C is

$\text{off-policy A2C}: \left \{ \begin{aligned} \text{Behavior policy:} S & \sim\beta \\ \text{Advantage}: \delta_t & = r_{t+1} + \gamma v(s_{t+1};w_t) - v(s_t;w_t) \\ \text{Actor}: \theta_{t+1} & = \theta_t + \alpha_\theta \frac{\pi(a_t|s_t;\theta)}{\beta(a_t|s_t)} \delta_t\nabla_\theta \ln\pi(a_t|s_t;\theta) \\ \text{Critic}: w_{t+1} & = w_t + \alpha_w \frac{\pi(a_t|s_t;\theta)}{\beta(a_t|s_t)}\delta_t \nabla_w v(s_t,;w_t) \end{aligned} \right.$

We rewrite the objective function of the update rule

$\text{off-policy A2C}: \left \{ \begin{aligned} \text{Behavior policy:} S & \sim\beta \\ \text{Advantage}: \Delta(S) & = R + \gamma v(S^\prime;w) - v(S;w) \\ \text{Actor}: \max_\theta J(\theta) & = \mathbb{E}_{S\sim\rho,A\sim\beta}[\frac{\pi(A|S;\theta)}{\beta(A|S)}\Delta(S) \ln\pi(A|S;\theta)] \\ \text{Critic}: \min_w J(w) & = \mathbb{E}_{S\sim\rho}[(R + \gamma v(S^\prime;w_t) - v(S_t;w_t))^2] = \mathbb{E}_{S\sim_\eta}[\Delta(S)] \end{aligned} \right.$

Pesudocode

10.4 Deterministic Actor-Critic

Deterministic policy means for any state, a single action is given a probability of one and all the other actions are given probabilities of zero. It’s important to study the deterministic case since it is naturally off-policy and can effectively handle continous action spaces. We usually use
$\mu(s;\theta)$
to denote a deterministic policy, $\mu$ directly gives the action since it is a mapping from $\mathcal{S}$ to $\mathcal{A}$ . For sake of simplicity, we often write $\mu(s;\theta)$ as $\mu(s)$ for short.

Theorem 10.2 (Deterministic policy gradient theorem). The gradient of $J(\theta)$ is
$\begin{aligned} \nabla_\theta J(\theta) & = \sum_{s\in\mathcal{S}}\eta(s) \nabla_\theta \mu(s) (\nabla_aq_\mu(s,a=\mu(s))) \\ & = \mathbb{E}_{S\sim\eta} \Big[ \nabla_\theta \mu(S)(\nabla_a q_\mu(S,a=\mu(S))) \Big] \end{aligned}$
where $\eta$ is a distribution of the states.

The gradient of deterministic case doesn’t invovle the action random variable $A$ . As a result, when we use samples to approximate the true gradient, it is not required to sample actions. Therefore, the deterministic policy gradient method is off-policy.

Pesudocode