Reinforcement Learning with Code 【Chapter 10. Actor Critic】

这篇具有很好参考价值的文章主要介绍了Reinforcement Learning with Code 【Chapter 10. Actor Critic】。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

Reinforcement Learning with Code 【Chapter 10. Actor Critic】

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning.
This code refers to Mofan’s reinforcement learning course.

10.1 The simplest actor-critic algorithm (QAC)

​ Recall the idea of policy gradient method is to search for an optimal policy by maximizing a scalar metric J ( θ ) J(\theta) J(θ). The metric has three options, average state value E [ v π ( S ) ] \mathbb{E}[v_\pi(S)] E[vπ(S)], average one step reward E [ r π ( S ) ] \mathbb{E}[r_\pi(S)] E[rπ(S)] or average state value from a specific state s 0 s_0 s0

​ According to policy gradient theorem in chapter 9, we are informed that

θ t + 1 = θ t + α ∇ θ J ( θ t ) = θ t + α E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) q π ( S , A ) ] \begin{aligned} \theta_{t+1} & = \theta_t + \alpha \nabla_\theta J(\theta_t) \\ & = \theta_t + \alpha \mathbb{E}_{S\sim\eta, A\sim\pi} [\nabla_\theta \ln \pi(A|S,\theta_t)q_\pi(S,A)] \end{aligned} θt+1=θt+αθJ(θt)=θt+αESη,Aπ[θlnπ(AS,θt)qπ(S,A)]

where η \eta η is a distribution of the states. Since the true gradient is unknwon, we can use a stochastic gradient ot approximated it, hence we have

θ t + 1 = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) q t ( s t , a t ) \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln\pi(a_t|s_t,\theta_t)q_t(s_t,a_t) θt+1=θt+αθlnπ(atst,θt)qt(st,at)

  • In policy gradient method such as REINFORCE, we use the idea of Monte-Carlo to approximate the true value q t ( s t , a t ) q_t(s_t,a_t) qt(st,at). Where q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) is approximated by an episode return that is q t ( s t , a t ) = ∑ k = t + 1 T γ k − t − 1 r q_t(s_t,a_t)=\sum_{k=t+1}^T \gamma^{k-t-1}r qt(st,at)=k=t+1Tγkt1r.
  • If q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) is estimated by value function approximationg, and the value function is updated using the idea of TD learning. The corresponding algorithms are usually called actor-critic. Therefore, actor-critic methods can be seen as one of the policy gradient method.

​ When we use the parameterized value function q ( s , a ; w ) q(s,a;w) q(s,a;w) to approximate the q t ( s t , a t ) q_t(s_t,a_t) qt(st,at), and the value function is updated by the idea of Sarsa of TD learning, the algorithm is called Q actor-critic (QAC). The core idea of QAC is that

QAC: { Actor : θ t + 1 = θ t + α θ ∇ θ ln ⁡ π ( a t ∣ s t ; θ ) q ( s t , a t ; w t ) Critic : w t + 1 = w t + α w [ r t + 1 + γ q ( s t + 1 , a t + 1 ; w t ) − q ( s t , a t ; w t ) ] ∇ w q ( s t , a t ; w t ) \text{QAC:} \left\{ \begin{aligned} \text{Actor}: \theta_{t+1} & = \theta_t + \alpha_\theta \nabla_\theta \ln\pi(a_t|s_t;\theta) q(s_t,a_t;w_t) \\ \text{Critic}: w_{t+1} & = w_t + \alpha_w [r_{t+1}+\gamma q(s_{t+1},a_{t+1};w_t) - q(s_t,a_t;w_t)]\nabla_w q(s_t,a_t;w_t) \end{aligned} \right. QAC:{Actor:θt+1Critic:wt+1=θt+αθθlnπ(atst;θ)q(st,at;wt)=wt+αw[rt+1+γq(st+1,at+1;wt)q(st,at;wt)]wq(st,at;wt)

We use value function approximation to approximate true q value q t ( s t , a t ) q_t(s_t,a_t) qt(st,at). Meanwhile we use the idea of Sarsa to update our value function.

​ We can write the objective function of the update rule

QAC: { Actor : max ⁡ θ J ( θ ) = E S ∼ η , A ∼ π [ ln ⁡ π ( A ∣ S ; θ t ) q ( S , A ; w t ) ] Critic : min ⁡ w J ( w ) = E S ∼ η , A ∼ π [ ( R + γ q ( S ′ , A ; w t ) − q ( S t , A ; w t ) ) 2 ] \text{QAC:} \left \{ \begin{aligned} \textcolor{red}{\text{Actor}: \max_\theta J(\theta)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta, A\sim\pi}[\ln\pi(A|S;\theta_t)q(S,A;w_t)]} \\ \textcolor{red}{\text{Critic}: \min_w J(w)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta, A\sim\pi}[(R + \gamma q(S^\prime,A;w_t) - q(S_t,A;w_t))^2]} \end{aligned} \right. QAC: Actor:θmaxJ(θ)Critic:wminJ(w)=ESη,Aπ[lnπ(AS;θt)q(S,A;wt)]=ESη,Aπ[(R+γq(S,A;wt)q(St,A;wt))2]

Pesudocode

Reinforcement Learning with Code 【Chapter 10. Actor Critic】,Reinforcement Learning,python,人工智能

10.2 Advantage Actor-Critic (A2C)

​ The core idea of A2C is to introduce a baseline to reduce estimation variance. That is
E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) q π ( S , A ) ] = E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) ( q π ( S , A ) − b ( S ) ) ] \mathbb{E}_{S\sim\eta,A\sim\pi}[\nabla_\theta \ln \pi(A|S;\theta_t)q_\pi(S,A)] = \mathbb{E}_{S\sim\eta,A\sim\pi}[\nabla_\theta \ln \pi(A|S;\theta_t)(q_\pi(S,A)-b(S))] ESη,Aπ[θlnπ(AS;θt)qπ(S,A)]=ESη,Aπ[θlnπ(AS;θt)(qπ(S,A)b(S))]
where the additional baseline b ( S ) b(S) b(S) is a scalar function of S S S. Add a baseline doesn’t affect the expectation of the above equation that is
E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) b ( S ) ] = 0 = ∑ s ∈ S η ( s ) ∑ a ∈ A π ( a ∣ s ; θ t ) ∇ θ ln ⁡ π ( a ∣ s ; θ t ) b ( s ) = ∑ s ∈ S η ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s ; θ t ) b ( s ) = ∑ s ∈ S η ( s ) b ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s ; θ t ) = ∑ s ∈ S η ( s ) b ( s ) ∇ θ π ∑ a ∈ A ( a ∣ s ; θ t ) = ∑ s ∈ S η ( s ) b ( s ) ∇ θ 1 = 0 \begin{aligned} \mathbb{E}_{S\sim\eta,A\sim\pi}[\nabla_\theta \ln \pi(A|S;\theta_t)b(S)] & = 0 \\ & = \sum_{s\in\mathcal{S}}\eta(s)\sum_{a\in\mathcal{A}}\pi(a|s;\theta_t) \nabla_\theta \ln \pi(a|s;\theta_t)b(s) \\ & = \sum_{s\in\mathcal{S}}\eta(s)\sum_{a\in\mathcal{A}}\nabla_\theta\pi(a|s;\theta_t)b(s) \\ & = \sum_{s\in\mathcal{S}}\eta(s)b(s)\sum_{a\in\mathcal{A}}\nabla_\theta\pi(a|s;\theta_t) \\ & = \sum_{s\in\mathcal{S}}\eta(s)b(s)\nabla_\theta\pi\sum_{a\in\mathcal{A}}(a|s;\theta_t) \\ & = \sum_{s\in\mathcal{S}}\eta(s)b(s)\nabla_\theta1 = 0 \end{aligned} ESη,Aπ[θlnπ(AS;θt)b(S)]=0=sSη(s)aAπ(as;θt)θlnπ(as;θt)b(s)=sSη(s)aAθπ(as;θt)b(s)=sSη(s)b(s)aAθπ(as;θt)=sSη(s)b(s)θπaA(as;θt)=sSη(s)b(s)θ1=0

How to find the optimal baseline? The derivation is omitted. The optimal baseline is

b ∗ ( s ) = E A ∼ π [ ∣ ∣ ∇ θ ln ⁡ π ( A ∣ s ; θ t ) ∣ ∣ 2 q π ( s , A ) ] E A ∼ π [ ∣ ∣ ∇ θ ln ⁡ π ( A ∣ s ; θ t ) ∣ ∣ 2 ] b^*(s) = \frac{\mathbb{E}_{A\sim\pi}[||\nabla_\theta \ln\pi(A|s;\theta_t)||^2 q_\pi(s,A)]}{\mathbb{E}_{A\sim\pi}[||\nabla_\theta \ln\pi(A|s;\theta_t)||^2]} b(s)=EAπ[∣∣θlnπ(As;θt)2]EAπ[∣∣θlnπ(As;θt)2qπ(s,A)]

But its too complex to use in practice. If the weight ∣ ∣ ∇ θ ln ⁡ π ( A ∣ s ; θ t ) ∣ ∣ 2 ||\nabla_\theta \ln\pi(A|s;\theta_t)||^2 ∣∣θlnπ(As;θt)2 is removed, we can obtain a suboptimal baseline that has a concise expression:

b † ( s ) = E [ q π ( s , A ) ] = v π ( s ) \textcolor{red}{b^\dagger (s) = \mathbb{E}[q_\pi(s,A)] = v_\pi(s)} b(s)=E[qπ(s,A)]=vπ(s)

The suboptimal baseline is the state value of state s s s.

When b ( s ) = v π ( s ) b(s)=v_\pi(s) b(s)=vπ(s), the gradient-ascent becomes
θ t + 1 = θ t + α E [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) [ q π ( S , A ) − v π ( S ) ] ] = θ t + α E [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) δ π ( S , A ) ] \begin{aligned} \theta_{t+1} & = \theta_t + \alpha\mathbb{E}[\nabla_\theta \ln\pi(A|S;\theta_t)[q_\pi(S,A)-v_\pi(S)]] \\ & = \theta_t + \alpha\mathbb{E}[\nabla_\theta\ln\pi(A|S;\theta_t) \delta_\pi(S,A)] \end{aligned} θt+1=θt+αE[θlnπ(AS;θt)[qπ(S,A)vπ(S)]]=θt+αE[θlnπ(AS;θt)δπ(S,A)]
Here,
δ π ( S , A ) = q π ( S , A ) − v π ( S ) \textcolor{red}{\delta_\pi(S,A) = q_\pi(S,A) - v_\pi(S)} δπ(S,A)=qπ(S,A)vπ(S)
is called advantage function, which reflects the advantage of one action over the others. More specifically, note that v π ( s ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) v_\pi(s)=\sum_{a\in\mathcal{A}}\pi(a|s)q_\pi(s,a) vπ(s)=aAπ(as)qπ(s,a) is the mean of the action value. If δ π ( s , a ) > 0 \delta_\pi(s,a)>0 δπ(s,a)>0, it means that the corresponding action has a greater value than the mean value.

The stochastic version is
θ t + 1 = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t ; θ t ) [ q t ( s t , a t ) − v t ( s t ) ] = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t ; θ t ) δ t ( s t , a t ) \begin{aligned} \theta_{t+1} & = \theta_t + \alpha\nabla_\theta \ln\pi(a_t|s_t;\theta_t)[q_t(s_t,a_t)-v_t(s_t)] \\ & = \theta_t + \alpha\nabla_\theta\ln\pi(a_t|s_t;\theta_t) \delta_t(s_t,a_t) \end{aligned} θt+1=θt+αθlnπ(atst;θt)[qt(st,at)vt(st)]=θt+αθlnπ(atst;θt)δt(st,at)
we need to estimate the true q-value q t ( s t , a t ) q_t(s_t,a_t) qt(st,at). There are many ways:

  • If q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) and v t ( s t ) v_t(s_t) vt(st) are estimated by Monte-Carlo learning, the algorithm is called REINFORCE with a baseline.
  • If q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) and v t ( s t ) v_t(s_t) vt(st) are estimated by TD learning, the algorithm is usually called advantage actor-critic (A2C).

q t ( s t , a t ) − v t ( s t ) = r t + 1 + γ q t ( s t + 1 , a t + 1 ) − v t ( s t ) ≈ r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) \begin{aligned} q_t(s_t,a_t) - v_t(s_t) & = r_{t+1} +\gamma q_t(s_{t+1},a_{t+1}) - v_t(s_t) \\ & \textcolor{red}{\approx r_{t+1} +\gamma v_t(s_{t+1}) - v_t(s_t)} \end{aligned} qt(st,at)vt(st)=rt+1+γqt(st+1,at+1)vt(st)rt+1+γvt(st+1)vt(st)

Hence, we don’t need to maintain two networks to represent v π ( s ) v_\pi(s) vπ(s) and q π ( s , a ) q_\pi(s,a) qπ(s,a). We just need one network to represent v π ( s ) v_\pi(s) vπ(s).

In A2C we use one policy network π ( a ∣ s ; θ ) \pi(a|s;\theta) π(as;θ) and one state value network v ( s ; w ) v(s;w) v(s;w). The core idea of A2C is that

A2C : { Advantage : δ t = r t + 1 + γ v ( s t + 1 ; w t ) − v ( s t ; w t ) Actor : θ t + 1 = θ t + α θ δ t ∇ θ ln ⁡ π ( a t ∣ s t ; θ ) Critic : w t + 1 = w t + α w δ t ∇ w v ( s t , ; w t ) \text{A2C}: \left \{ \begin{aligned} \text{Advantage}: \delta_t & = r_{t+1} + \gamma v(s_{t+1};w_t) - v(s_t;w_t) \\ \text{Actor}: \theta_{t+1} & = \theta_t + \alpha_\theta \textcolor{blue}{\delta_t}\nabla_\theta \ln\pi(a_t|s_t;\theta) \\ \text{Critic}: w_{t+1} & = w_t + \alpha_w \textcolor{blue}{\delta_t} \nabla_w v(s_t,;w_t) \end{aligned} \right. A2C: Advantage:δtActor:θt+1Critic:wt+1=rt+1+γv(st+1;wt)v(st;wt)=θt+αθδtθlnπ(atst;θ)=wt+αwδtwv(st,;wt)

We can write the objective function of the update rule

A2C : { Advantage: Δ ( S ) = R + γ v ( S ′ ; w t ) − v ( S ; w t ) Actor : max ⁡ θ J ( θ ) = E S ∼ η , A ∼ π [ ln ⁡ π ( A ∣ S ; θ ) Δ ( S ) ] Critic : min ⁡ w J ( w ) = E S ∼ η [ ( R + γ v ( S ′ ; w ) − v ( S ; w ) ) 2 ] = E S ∼ η [ Δ ( S ) ] \text{A2C}: \left\{ \begin{aligned} \text{Advantage:} \Delta(S) & = R+\gamma v(S^\prime;w_t) - v(S;w_t) \\ \textcolor{red}{\text{Actor}: \max_\theta J(\theta)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta, A\sim\pi}[\ln\pi(A|S;\theta)\Delta(S)]} \\ \textcolor{red}{\text{Critic}: \min_w J(w)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta}[(R + \gamma v(S^\prime;w) - v(S;w))^2]} = \mathbb{E}_{S\sim_\eta}[\Delta(S)] \end{aligned} \right. A2C: Advantage:Δ(S)Actor:θmaxJ(θ)Critic:wminJ(w)=R+γv(S;wt)v(S;wt)=ESη,Aπ[lnπ(AS;θ)Δ(S)]=ESη[(R+γv(S;w)v(S;w))2]=ESη[Δ(S)]

Pesudocode

Reinforcement Learning with Code 【Chapter 10. Actor Critic】,Reinforcement Learning,python,人工智能

10.3 Off-policy Actor-Critic

Importance Sampling

​ The key technique to convert the AC algorithm to off-policy is importance sampling. Consider a random variable X ∈ X X\in\mathcal{X} XX. Suppose that p 0 ( X ) p_0(X) p0(X) is a probability distribution. Our goal is to estimate E X ∼ p 0 [ X ] \mathbb{E}_{X\sim p_0}[X] EXp0[X]. We also known the p 1 ( X ) p_1(X) p1(X) is a probability distribution of X X X. How can we use the probability p 1 ( X ) p_1(X) p1(X) to sample data to estimate E X ∼ p 0 [ X ] \mathbb{E}_{X\sim p_0}[X] EXp0[X]. The technique is importance sampling. Suppose we have some i.i.d. samples { x i } i = 1 n \{x_i\}^n_{i=1} {xi}i=1n generated by distribution p 1 ( X ) p_1(X) p1(X).
E X ∼ p 0 [ X ] = ∑ x ∈ X p 0 ( x ) x = ∑ x ∈ X p 1 ( x ) p 0 ( x ) p 1 ( x ) x ⏟ f ( x ) = E X ∼ p 1 [ f ( X ) ] E X ∼ p 0 [ X ] = E X ∼ p 1 [ f ( X ) ] ≈ f ˉ = 1 n ∑ i = 1 n f ( x i ) = 1 n ∑ i = 1 n p 0 ( x i ) p 1 ( x i ) ⏟ importance weight x i \mathbb{E}_{X\sim p_0}[X] = \sum_{x\in\mathcal{X}}p_0(x)x = \sum_{x\in\mathcal{X}}p_1(x)\underbrace{\frac{p_0(x)}{p_1(x)}x}_{f(x)} = \mathbb{E}_{X\sim p_1}[f(X)] \\ \mathbb{E}_{X\sim p_0}[X] = \mathbb{E}_{X\sim p_1}[f(X)] \approx \bar{f} = \frac{1}{n} \sum^n_{i=1}f(x_i) = \frac{1}{n} \sum^n_{i=1} \underbrace{\frac{p_0(x_i)}{p_1(x_i)}}_{\text{importance weight}}x_i EXp0[X]=xXp0(x)x=xXp1(x)f(x) p1(x)p0(x)x=EXp1[f(X)]EXp0[X]=EXp1[f(X)]fˉ=n1i=1nf(xi)=n1i=1nimportance weight p1(xi)p0(xi)xi
An Example

​ Consider X ∈ X = + 1 , − 1 X\in\mathcal{X}={+1,-1} XX=+1,1 Suppose the p 0 p_0 p0 is a probability distribution satisfying
p 0 ( X = + 1 ) = 0.5 , p 0 ( X = − 1 ) = 0.5 p_0(X=+1)=0.5, p_0(X=-1)=0.5 p0(X=+1)=0.5,p0(X=1)=0.5
The expectaton of X X X over p 0 p_0 p0 is
E X ∼ p 0 [ X ] = ( + 1 ) × 0.5 + ( − 1 ) × 0.5 = 0 \mathbb{E}_{X\sim p_0}[X] = (+1)\times 0.5 + (-1) \times 0.5 = 0 EXp0[X]=(+1)×0.5+(1)×0.5=0
Suppose p 1 p_1 p1 is a probability distribution satisfying
p 0 ( X = + 1 ) = 0.8 , p 0 ( X = − 1 ) = 0.2 p_0(X=+1)=0.8, p_0(X=-1)=0.2 p0(X=+1)=0.8,p0(X=1)=0.2
The expectaton of X X X over p 0 p_0 p0 is
E X ∼ p 1 [ X ] = ( + 1 ) × 0.8 + ( − 1 ) × 0.2 = 0.6 \mathbb{E}_{X\sim p_1}[X] = (+1)\times 0.8 + (-1) \times 0.2 = 0.6 EXp1[X]=(+1)×0.8+(1)×0.2=0.6
We can use the importance samping techique to sample data under distribution p 1 p_1 p1 to compute E X ∼ p 0 [ X ] \mathbb{E}_{X\sim p_0}[X] EXp0[X]
E X ∼ p 0 [ X ] = 1 n ∑ i = 1 n p 0 ( x i ) p 1 ( x i ) x i \mathbb{E}_{X\sim p_0}[X] = \frac{1}{n}\sum_{i=1}^n \frac{p_0(x_i)}{p_1(x_i)}x_i EXp0[X]=n1i=1np1(xi)p0(xi)xi

import numpy as np
import matplotlib.pyplot as plt
# reproducible
np.random.seed(0)

# 定义元素和对应的概率
elements = [1, -1]
probs1 = [0.5, 0.5]
probs2 = [0.8, 0.2]

# 重要性采样 importance sample
sample_times = 300
sample_list = []
i_sample_list = []
average_list = []
importance_list = []
for i in range(sample_times):
    sample = np.random.choice(elements, p=probs2)
    sample_list.append(sample)
    average_list.append(np.mean(sample_list))
    if sample == elements[0]:
        i_sample_list.append(probs1[0] / probs2[0] * sample)
    elif sample == elements[1]:
        i_sample_list.append(probs1[1] / probs2[1] * sample)
    importance_list.append(np.mean(i_sample_list))



plt.plot(range(len(sample_list)), sample_list, 'o', markerfacecolor='none', label='sample data')
plt.plot(range(len(average_list)), average_list, 'b--', label='average')
plt.plot(range(len(importance_list)), importance_list, 'g--', label='importance sampling')
plt.axhline(y=0.6, color='r', linestyle='--')
plt.axhline(y=0, color='r', linestyle='--')
plt.ylim(-1.5, 2.5) # 限制y轴显示范围
plt.xlim(0,sample_times) # 限制x轴显示范围
plt.legend(loc='upper right')
plt.show()

Reinforcement Learning with Code 【Chapter 10. Actor Critic】,Reinforcement Learning,python,人工智能

Off-policy policy gradient theorem

​ With importance sampling, we are ready to present the off-policy gradient theorem. Suppose that the β \beta β is a behavior policy. Our goal is to use the samples generated by behavoir policy β \beta β to learn a target policy π \pi π that can maximize the following metric
max ⁡ θ J ( θ ) = E S ∼ d β [ v π ( S ) ] \max_\theta J(\theta) = \mathbb{E}_{S\sim d_\beta}[v_\pi(S)] θmaxJ(θ)=ESdβ[vπ(S)]
Theorem 10.1 (Stochastic off-policy policy gradient theorem). In the discounted case where γ ∈ ( 0 , 1 ) \gamma\in(0,1) γ(0,1), the gradient of J ( θ ) J(\theta) J(θ) is
∇ θ J ( θ ) = E S ∼ ρ , A ∼ β [ π ( A ∣ S ; θ ) β ( A ∣ S ) ⏟ importance weight ∇ θ ln ⁡ π ( A ∣ S ; θ ) q π ( S , A ) ] \textcolor{red}{\nabla_\theta J(\theta) = \mathbb{E}_{S\sim\rho, A\sim\beta}\Big[\underbrace{\frac{\pi(A|S;\theta)}{\beta(A|S)}}_{\text{importance weight}} \nabla_\theta \ln \pi(A|S;\theta) q_\pi(S,A) \Big]} θJ(θ)=ESρ,Aβ[importance weight β(AS)π(AS;θ)θlnπ(AS;θ)qπ(S,A)]
where the state distribution ρ \rho ρ is
ρ ( s ) ≜ ∑ s ′ ∈ S d β ( s ′ ) Pr ⁡ π ( s ∣ s ′ ) \rho(s) \triangleq \sum_{s^\prime\in\mathcal{S}} d_\beta(s^\prime) \Pr_\pi(s|s^\prime) ρ(s)sSdβ(s)πPr(ss)
where Pr ⁡ π ( s ∣ s ′ ) = ∑ k = 0 ∞ γ k [ P π k ] s ′ , s = [ ( I − γ P π ) − 1 ] s ′ , s \Pr_\pi(s|s^\prime)=\sum_{k=0}^\infty \gamma^k[P^k_\pi]_{s^\prime,s}=[(I-\gamma P_\pi)^{-1}]_{s^\prime,s} Prπ(ss)=k=0γk[Pπk]s,s=[(IγPπ)1]s,s is the discounted total probability of transitioning from s ′ s^\prime s to s s s under policy π \pi π.

​ The off-policy policy gradient is invariant to any additional baseline b ( s ) b(s) b(s). In particular, we have
∇ θ J ( θ ) = E S ∼ ρ , A ∼ β [ π ( A ∣ S ; θ ) β ( A ∣ S ) ∇ θ ln ⁡ π ( A ∣ S ; θ ) ( q π ( S , A ) − b ( S ) ) ] \nabla_\theta J(\theta) = \mathbb{E}_{S\sim\rho, A\sim\beta}\Big[\frac{\pi(A|S;\theta)}{\beta(A|S)} \nabla_\theta \ln \pi(A|S;\theta) \big( q_\pi(S,A) - b(S) \big) \Big] θJ(θ)=ESρ,Aβ[β(AS)π(AS;θ)θlnπ(AS;θ)(qπ(S,A)b(S))]
when we take the state value as the baseline v π ( S ) = b ( S ) v_\pi(S)=b(S) vπ(S)=b(S), there comes the advantage function.
δ π ( S , A ) = q π ( S , A ) − v π ( S ) \delta_\pi(S,A) = q_\pi(S,A) - v_\pi(S) δπ(S,A)=qπ(S,A)vπ(S)
The corresponding stochastic gradient-ascent algorithm is
θ t + 1 = θ t + α θ π ( a t ∣ s t ; θ ) β ( a t ∣ s t ) ∇ θ ln ⁡ π ( a t ∣ s t ; θ t ) ( q t ( s t , a t ) − v t ( s t ) ) \theta_{t+1} = \theta_t + \alpha_\theta \frac{\pi(a_t|s_t;\theta)}{\beta(a_t|s_t)} \nabla_\theta \ln\pi(a_t|s_t;\theta_t)(q_t(s_t,a_t)-v_t(s_t)) θt+1=θt+αθβ(atst)π(atst;θ)θlnπ(atst;θt)(qt(st,at)vt(st))
The advantage function can be replaced by the TD error. That is
q t ( s t , a t ) − v t ( s t ) ≈ r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) ≜ δ t ( s t , a t ) q_t(s_t,a_t)-v_t(s_t) \approx r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) \triangleq \delta_t(s_t,a_t) qt(st,at)vt(st)rt+1+γvt(st+1)vt(st)δt(st,at)
In off-policy A2C we use the behavior policy β \beta β to obtain samples, to learn a policy network π ( a ∣ s ; θ ) \pi(a|s;\theta) π(as;θ), and one value network v ( s ; w ) v(s;w) v(s;w). The core idea of off-policy A2C is

off-policy A2C : { Behavior policy: S ∼ β Advantage : δ t = r t + 1 + γ v ( s t + 1 ; w t ) − v ( s t ; w t ) Actor : θ t + 1 = θ t + α θ π ( a t ∣ s t ; θ ) β ( a t ∣ s t ) δ t ∇ θ ln ⁡ π ( a t ∣ s t ; θ ) Critic : w t + 1 = w t + α w π ( a t ∣ s t ; θ ) β ( a t ∣ s t ) δ t ∇ w v ( s t , ; w t ) \text{off-policy A2C}: \left \{ \begin{aligned} \text{Behavior policy:} S & \sim\beta \\ \text{Advantage}: \delta_t & = r_{t+1} + \gamma v(s_{t+1};w_t) - v(s_t;w_t) \\ \text{Actor}: \theta_{t+1} & = \theta_t + \alpha_\theta \frac{\pi(a_t|s_t;\theta)}{\beta(a_t|s_t)} \delta_t\nabla_\theta \ln\pi(a_t|s_t;\theta) \\ \text{Critic}: w_{t+1} & = w_t + \alpha_w \frac{\pi(a_t|s_t;\theta)}{\beta(a_t|s_t)}\delta_t \nabla_w v(s_t,;w_t) \end{aligned} \right. off-policy A2C: Behavior policy:SAdvantage:δtActor:θt+1Critic:wt+1β=rt+1+γv(st+1;wt)v(st;wt)=θt+αθβ(atst)π(atst;θ)δtθlnπ(atst;θ)=wt+αwβ(atst)π(atst;θ)δtwv(st,;wt)

We rewrite the objective function of the update rule

off-policy A2C : { Behavior policy: S ∼ β Advantage : Δ ( S ) = R + γ v ( S ′ ; w ) − v ( S ; w ) Actor : max ⁡ θ J ( θ ) = E S ∼ ρ , A ∼ β [ π ( A ∣ S ; θ ) β ( A ∣ S ) Δ ( S ) ln ⁡ π ( A ∣ S ; θ ) ] Critic : min ⁡ w J ( w ) = E S ∼ ρ [ ( R + γ v ( S ′ ; w t ) − v ( S t ; w t ) ) 2 ] = E S ∼ η [ Δ ( S ) ] \text{off-policy A2C}: \left \{ \begin{aligned} \text{Behavior policy:} S & \sim\beta \\ \text{Advantage}: \Delta(S) & = R + \gamma v(S^\prime;w) - v(S;w) \\ \text{Actor}: \max_\theta J(\theta) & = \mathbb{E}_{S\sim\rho,A\sim\beta}[\frac{\pi(A|S;\theta)}{\beta(A|S)}\Delta(S) \ln\pi(A|S;\theta)] \\ \text{Critic}: \min_w J(w) & = \mathbb{E}_{S\sim\rho}[(R + \gamma v(S^\prime;w_t) - v(S_t;w_t))^2] = \mathbb{E}_{S\sim_\eta}[\Delta(S)] \end{aligned} \right. off-policy A2C: Behavior policy:SAdvantage:Δ(S)Actor:θmaxJ(θ)Critic:wminJ(w)β=R+γv(S;w)v(S;w)=ESρ,Aβ[β(AS)π(AS;θ)Δ(S)lnπ(AS;θ)]=ESρ[(R+γv(S;wt)v(St;wt))2]=ESη[Δ(S)]

Pesudocode

Reinforcement Learning with Code 【Chapter 10. Actor Critic】,Reinforcement Learning,python,人工智能

10.4 Deterministic Actor-Critic

​ Deterministic policy means for any state, a single action is given a probability of one and all the other actions are given probabilities of zero. It’s important to study the deterministic case since it is naturally off-policy and can effectively handle continous action spaces. We usually use
a = μ ( s ; θ ) a = \mu(s;\theta) a=μ(s;θ)
to denote a deterministic policy, μ \mu μ directly gives the action since it is a mapping from S \mathcal{S} S to A \mathcal{A} A. For sake of simplicity, we often write μ ( s ; θ ) \mu(s;\theta) μ(s;θ) as μ ( s ) \mu(s) μ(s) for short.

Theorem 10.2 (Deterministic policy gradient theorem). The gradient of J ( θ ) J(\theta) J(θ) is
∇ θ J ( θ ) = ∑ s ∈ S η ( s ) ∇ θ μ ( s ) ( ∇ a q μ ( s , a = μ ( s ) ) ) = E S ∼ η [ ∇ θ μ ( S ) ( ∇ a q μ ( S , a = μ ( S ) ) ) ] \begin{aligned} \nabla_\theta J(\theta) & = \sum_{s\in\mathcal{S}}\eta(s) \nabla_\theta \mu(s) (\nabla_aq_\mu(s,a=\mu(s))) \\ & = \mathbb{E}_{S\sim\eta} \Big[ \nabla_\theta \mu(S)(\nabla_a q_\mu(S,a=\mu(S))) \Big] \end{aligned} θJ(θ)=sSη(s)θμ(s)(aqμ(s,a=μ(s)))=ESη[θμ(S)(aqμ(S,a=μ(S)))]
where η \eta η is a distribution of the states.

​ The gradient of deterministic case doesn’t invovle the action random variable A A A. As a result, when we use samples to approximate the true gradient, it is not required to sample actions. Therefore, the deterministic policy gradient method is off-policy.

Pesudocode

Reinforcement Learning with Code 【Chapter 10. Actor Critic】,Reinforcement Learning,python,人工智能

Reference

赵世钰老师的课程文章来源地址https://www.toymoban.com/news/detail-644157.html

到了这里,关于Reinforcement Learning with Code 【Chapter 10. Actor Critic】的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • AIpowered Quantum Chess With Reinforcement Learning: Is

    作者:禅与计算机程序设计艺术 Quantum chess is one of the most exciting and promising topics in computer science today. We may think that quantum mechanics will revolutionize our understanding of nature but it hasn\\\'t happened yet. The field of quantum chess is still very young and researchers are trying to develop new algorithms and techniques f

    2024年02月07日
    浏览(36)
  • Jan 2023-Prioritizing Samples in Reinforcement Learning with Reducible Loss

      本文 建议根据样本的可学习性进行抽样,而不是从经验回放中随机抽样 。如果有可能减少代理对该样本的损失,则认为该样本是可学习的。我们将可以减少样本损失的数量称为其可减少损失(ReLo)。这与Schaul等人[2016]的vanilla优先级不同,后者只是 对具有高损失的样本给予

    2024年02月05日
    浏览(54)
  • Auto-Tuning with Reinforcement Learning for Permissioned Blockchain Systems

    在一个允许的区块链中,性能决定了它的发展,而发展很大程度上受其参数的影响。然而,由于分布式参数带来的困难,关于自动调优以获得更好性能的研究已经有些停滞;因此,很难提出有效的自动调整优化方案。为了缓解这一问题,我们首先探索了Hyperledger Fabric(一种许可

    2024年02月02日
    浏览(48)
  • 论文笔记|Not All Tasks Are Equally Difficult MultiTask Reinforcement Learning with Dynamic Depth Routing

    AAAI24 多任务强化学习致力于用单一策略完成一组不同的任务。为了通过跨多个任务共享参数来提高数据效率,常见的做法是将网络分割成不同的模块,并训练路由网络将这些模块重新组合成特定于任务的策略。然而,现有的路由方法对所有任务采用固定数量的模块,忽略了具

    2024年01月19日
    浏览(38)
  • 强化学习中的 AC(Actor-Critic)、A2C(Advantage Actor-Critic)和A3C(Asynchronous Advantage Actor-Critic)算法

    AC(Actor-Critic)算法是强化学习中的一种基本方法,它结合了策略梯度方法和价值函数方法的优点。在 Actor-Critic 算法中,有两个主要的组成部分:演员(Actor)和评论家(Critic)。以下是 AC 算法的关键要素和工作原理: 演员(Actor) : 演员负责根据当前状态选择动作。它通常

    2024年04月11日
    浏览(36)
  • JoyRL Actor-Critic算法

    这里策略梯度算法特指 蒙特卡洛策略梯度算法 ,即 REINFORCE 算法。 相比于 DQN 之类的基于价值的算法,策略梯度算法有以下优点。 适配连续动作空间 。在将策略函数设计的时候我们已经展开过,这里不再赘述。 适配随机策略 。由于策略梯度算法是基于策略函数的,因此

    2024年01月23日
    浏览(52)
  • 强化学习13——Actor-Critic算法

    Actor-Critic算法结合了策略梯度和值函数的优点,我们将其分为两部分,Actor(策略网络)和Critic(价值网络) Actor与环境交互,在Critic价值函数的指导下使用策略梯度学习好的策略 Critic通过Actor与环境交互收集的数据学习,得到一个价值函数,来判断当前状态哪些动作是好,

    2024年02月19日
    浏览(37)
  • 深度强化学习——actor-critic算法(4)

    一、本文概要: actor是策略网络,用来控制agent运动,你可以把他看作是运动员,critic是价值网络,用来给动作打分,你可以把critic看作是裁判,这节课的内容就是构造这两个神经网络,然后通过环境给的奖励来学习这两个网络 1、首先看一下如何构造价值网络value network: Π

    2024年02月02日
    浏览(40)
  • 【深度强化学习】(4) Actor-Critic 模型解析,附Pytorch完整代码

    大家好,今天和各位分享一下深度强化学习中的 Actor-Critic 演员评论家算法, Actor-Critic 算法是一种综合了策略迭代和价值迭代的集成算法 。我将使用该模型结合 OpenAI 中的 Gym 环境完成一个小游戏,完整代码可以从我的 GitHub 中获得: https://github.com/LiSir-HIT/Reinforcement-Learning

    2024年02月03日
    浏览(48)
  • Actor-Critic(A2C)算法 原理讲解+pytorch程序实现

    强化学习在人工智能领域中具有广泛的应用,它可以通过与环境互动来学习如何做出最佳决策。本文将介绍一种常用的强化学习算法:Actor-Critic并且附上基于pytorch实现的代码。 Actor-Critic算法是一种基于策略梯度(Policy Gradient)和价值函数(Value Function)的强化学习方法,通常

    2024年02月11日
    浏览(45)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包