迁移强化学习论文笔记(一)(Successor Features)

这篇具有很好参考价值的文章主要介绍了迁移强化学习论文笔记(一)(Successor Features)。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

迁移强化学习论文笔记(一)(Successor Features)

一.Background and problem formulation

M ≡ ( S , A , p , R , γ ) M \equiv(\mathcal{S}, \mathcal{A}, p, R, \gamma) M(S,A,p,R,γ)

S \cal S S:状态空间

A \cal A A:行动空间

p p p: p ( ⋅ ∣ s t , a t ) p(\cdot\mid s_t,a_t) p(st,at)状态转移概率

R R R: R ( s t , a t , s t + 1 ) R(s_t,a_t,s_{t+1}) R(st,at,st+1)奖励

二.Successor features

假设奖励函数可以写为
r ( s , a , s ′ ) = ϕ ( s , a , s ′ ) ⊤ w , r\left(s, a, s^{\prime}\right)=\boldsymbol{\phi}\left(s, a, s^{\prime}\right)^{\top} \mathbf{w}, r(s,a,s)=ϕ(s,a,s)w,
其中 ϕ ( s , a , s ′ ) \boldsymbol\phi(s,a,s') ϕ(s,a,s)是d维向量, w \mathbf w w是对应的权重。利用这种形式,我们有以下结论(定义 ϕ t + 1 = ϕ ( s t , a t , s t + 1 ) \boldsymbol \phi_{t+1}=\boldsymbol \phi(s_t,a_t,s_{t+1}) ϕt+1=ϕ(st,at,st+1)
Q π ( s , a ) = E π [ r t + 1 + γ r t + 2 + … ∣ S t = s , A t = a ] = E π [ ϕ t + 1 ⊤ w + γ ϕ t + 2 ⊤ w + … ∣ S t = s , A t = a ] = E π [ ∑ i = t ∞ γ i − t ϕ i + 1 ∣ S t = s , A t = a ] ⊤ w = ψ π ( s , a ) ⊤ w . \begin{aligned} Q^\pi(s, a) & =\mathrm{E}^\pi\left[r_{t+1}+\gamma r_{t+2}+\ldots \mid S_t=s, A_t=a\right] \\ & =\mathrm{E}^\pi\left[\boldsymbol{\phi}_{t+1}^{\top} \mathbf{w}+\gamma \boldsymbol{\phi}_{t+2}^{\top} \mathbf{w}+\ldots \mid S_t=s, A_t=a\right] \\ & =\mathrm{E}^\pi\left[\sum_{i=t}^{\infty} \gamma^{i-t} \boldsymbol{\phi}_{i+1} \mid S_t=s, A_t=a\right]^{\top} \mathbf{w}=\boldsymbol{\psi}^\pi(s, a)^{\top} \mathbf{w} . \end{aligned} Qπ(s,a)=Eπ[rt+1+γrt+2+St=s,At=a]=Eπ[ϕt+1w+γϕt+2w+St=s,At=a]=Eπ[i=tγitϕi+1St=s,At=a]w=ψπ(s,a)w.
ψ π ( s , a ) \boldsymbol \psi^{\pi}(s,a) ψπ(s,a)是在策略 π \pi π ( s , a ) (s,a) (s,a)的Successor Features(SFs)

由定义知 ψ π ( s , a ) = E π [ ϕ t + 1 + γ ϕ t + 2 + γ 2 ϕ t + 3 + ⋯ ∣ S t = s , A t = a ] \boldsymbol \psi^{\pi}(s,a)=\mathrm E^{\pi}[\boldsymbol{\phi}_{t+1}+\gamma \boldsymbol{\phi}_{t+2}+\gamma^2\boldsymbol{\phi}_{t+3}+\cdots\mid S_t=s,A_t=a] ψπ(s,a)=Eπ[ϕt+1+γϕt+2+γ2ϕt+3+St=s,At=a]可得如下贝尔曼公式
ψ π ( s , a ) = E π [ ϕ t + 1 + γ ϕ t + 2 + γ 2 ϕ t + 3 + ⋯ ∣ S t = s , A t = a ] = E S t + 1 , A t + 1 [ ϕ t + 1 + ψ π ( S t + 1 , A t + 1 ) ∣ S t = s , A t = a ] 如果采取确定策略 π = ϕ t + 1 ( s , a ) + E S t + 1 [ ψ π ( S t + 1 , π ( S t + 1 ) ) ∣ S t = s , A t = a ] \begin{aligned} \boldsymbol \psi^{\pi}(s,a)&=\mathrm E^{\pi}[\boldsymbol{\phi}_{t+1}+\gamma \boldsymbol{\phi}_{t+2}+\gamma^2\boldsymbol{\phi}_{t+3}+\cdots\mid S_t=s,A_t=a]\\ &=\mathrm{E}_{S_{t+1},A_{t+1}}[\boldsymbol{\phi}_{t+1}+\boldsymbol \psi^{\pi}(S_{t+1},A_{t+1})\mid S_t=s,A_t=a]\text{如果采取确定策略}\pi\\ &=\boldsymbol \phi_{t+1}(s,a)+\mathrm E_{S_{t+1}}[\boldsymbol \psi^{\pi}(S_{t+1},\pi(S_{t+1}))\mid S_t=s,A_t=a] \end{aligned} ψπ(s,a)=Eπ[ϕt+1+γϕt+2+γ2ϕt+3+St=s,At=a]=ESt+1,At+1[ϕt+1+ψπ(St+1,At+1)St=s,At=a]如果采取确定策略π=ϕt+1(s,a)+ESt+1[ψπ(St+1,π(St+1))St=s,At=a]
利用上式即可迭代求解 ψ π ( s , a ) \boldsymbol \psi^{\pi}(s,a) ψπ(s,a),而对于 w \mathbf w w的求解则是一个有监督学习问题很多机器学习算法都可进行。

这样对于不同的任务只要求解出不同的 w \mathbf w w即可。

三.Generalized policy improvement

作者在论文中还证明了迁移强化学习的泛化误差界

Theorem 1. (Generalized Policy Improvement) Let π 1 , π 2 , … , π n \pi_1, \pi_2, \ldots, \pi_n π1,π2,,πn be n n n decision policies and let Q ~ π 1 , Q ~ π 2 , … , Q ~ π n \tilde{Q}^{\pi_1}, \tilde{Q}^{\pi_2}, \ldots, \tilde{Q}^{\pi_n} Q~π1,Q~π2,,Q~πn be approximations of their respective action-value functions such that
∣ Q π i ( s , a ) − Q ~ π i ( s , a ) ∣ ≤ ϵ  for all  s ∈ S , a ∈ A , and  i ∈ { 1 , 2 , … , n } .  \left|Q^{\pi_i}(s, a)-\tilde{Q}^{\pi_i}(s, a)\right| \leq \epsilon \text { for all } s \in \mathcal{S}, a \in \mathcal{A} \text {, and } i \in\{1,2, \ldots, n\} \text {. } Qπi(s,a)Q~πi(s,a) ϵ for all sS,aA, and i{1,2,,n}

Define
π ( s ) ∈ argmax ⁡ a max ⁡ i Q ~ π i ( s , a ) . \pi(s) \in \underset{a}{\operatorname{argmax}} \max _i \tilde{Q}^{\pi_i}(s, a) . π(s)aargmaximaxQ~πi(s,a).

Then,
Q π ( s , a ) ≥ max ⁡ i Q π i ( s , a ) − 2 1 − γ ϵ Q^\pi(s, a) \geq \max _i Q^{\pi_i}(s, a)-\frac{2}{1-\gamma} \epsilon Qπ(s,a)imaxQπi(s,a)1γ2ϵ
for any s ∈ S s \in \mathcal{S} sS and a ∈ A a \in \mathcal{A} aA, where Q π Q^\pi Qπ is the action-value function of π \pi π.

proof:为简化符号,定义
Q m a x ( s , a ) = max i Q π i ( s , a ) ( 在策略 π i 中的最优动作价值函数 ) Q ~ m a x ( s , a ) = max i Q π i ~ ( s , a ) ( 在策略 π i 中最优动作价值函数的估计值 ) Q_{max}(s,a)=\text{max}_{i}Q^{\pi_i}(s,a)(在策略\pi_{i}中的最优动作价值函数)\\ \tilde{Q}_{max}(s,a)=\text{max}_{i}\tilde{Q^{\pi_{i}}}(s,a)(在策略\pi_{i}中最优动作价值函数的估计值) Qmax(s,a)=maxiQπi(s,a)(在策略πi中的最优动作价值函数)Q~max(s,a)=maxiQπi~(s,a)(在策略πi中最优动作价值函数的估计值)
借助以上符号我们有如下不等式
∣ Q max ⁡ ( s , a ) − Q ~ max ⁡ ( s , a ) ∣ = ∣ max ⁡ i Q π i ( s , a ) − max ⁡ i Q ~ π i ( s , a ) ∣ ≤ max ⁡ i ∣ Q π i ( s , a ) − Q ~ π i ( s , a ) ∣ ≤ ϵ . \left|Q_{\max }(s, a)-\tilde{Q}_{\max }(s, a)\right|=\left|\max _i Q^{\pi_i}(s, a)-\max _i \tilde{Q}^{\pi_i}(s, a)\right| \leq \max _i\left|Q^{\pi_i}(s, a)-\tilde{Q}^{\pi_i}(s, a)\right| \leq \epsilon . Qmax(s,a)Q~max(s,a) = imaxQπi(s,a)imaxQ~πi(s,a) imax Qπi(s,a)Q~πi(s,a) ϵ.
于是我们可得
Q max ⁡ ( s , a ) − ϵ ≤ Q ~ max ⁡ ( s , a ) Q_{\max }(s, a)-\epsilon \leq\tilde{Q}_{\max }(s, a) Qmax(s,a)ϵQ~max(s,a)
借助贝尔曼算子 T π T^{\pi} Tπ,其中
T π f ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( s ′ ∣ s , a ) [ V ( s ′ ) ] V ( s ′ ) = E a ∼ π ( a ∣ s ′ ) [ f ( s ′ , a ) ] r ( s , a ) = E s ′ ∼ p ( s ′ ∣ s , a ) [ r ( s , a , s ′ ) ] T^{\pi}f(s,a)=r(s,a)+\gamma\mathrm E_{s'\sim p(s'\mid s,a)}[V(s')]\\ V(s')=\mathrm E_{a\sim \pi(a\mid s')}[f(s',a)]\\ r(s,a)=\mathrm E_{s'\sim p(s'\mid s,a)}[r(s,a,s')] Tπf(s,a)=r(s,a)+γEsp(ss,a)[V(s)]V(s)=Eaπ(as)[f(s,a)]r(s,a)=Esp(ss,a)[r(s,a,s)]
因我们采用确定策略 π \pi π(在所有策略中选取能使得动作价值最大的动作), V ( s ′ ) = f ( s ′ , π ( s ′ ) ) V(s')=f(s',\pi(s')) V(s)=f(s,π(s))

对于任意 ( s , a ) ∈ S × A (s,a)\in \cal S \times \cal A (s,a)S×A和任意策略 π i \pi_{i} πi我们都有下式成立
T π Q ~ max ⁡ ( s , a ) = r ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) Q ~ max ⁡ ( s ′ , π ( s ′ ) ) = r ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) max ⁡ b Q ~ max ⁡ ( s ′ , b ) ≥ r ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) max ⁡ b Q max ⁡ ( s ′ , b ) − γ ϵ ≥ r ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) Q max ⁡ ( s ′ , π i ( s ′ ) ) − γ ϵ ≥ r ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) Q π i ( s ′ , π i ( s ′ ) ) − γ ϵ = T π i Q π i ( s , a ) − γ ϵ = Q π i ( s , a ) − γ ϵ . \begin{aligned} T^\pi \tilde{Q}_{\max }(s, a) & =r(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s,a\right) \tilde{Q}_{\max }\left(s^{\prime}, \pi\left(s^{\prime}\right)\right) \\ & =r(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \max _b \tilde{Q}_{\max }\left(s^{\prime}, b\right) \\ & \geq r(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \max _b Q_{\max }\left(s^{\prime}, b\right)-\gamma \epsilon \\ & \geq r(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) Q_{\max }\left(s^{\prime}, \pi_i\left(s^{\prime}\right)\right)-\gamma \epsilon \\ & \geq r(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) Q^{\pi_i}\left(s^{\prime}, \pi_i\left(s^{\prime}\right)\right)-\gamma \epsilon \\ & =T^{\pi_i} Q^{\pi_i}(s, a)-\gamma \epsilon \\ & =Q^{\pi_i}(s, a)-\gamma \epsilon . \end{aligned} TπQ~max(s,a)=r(s,a)+γsp(ss,a)Q~max(s,π(s))=r(s,a)+γsp(ss,a)bmaxQ~max(s,b)r(s,a)+γsp(ss,a)bmaxQmax(s,b)γϵr(s,a)+γsp(ss,a)Qmax(s,πi(s))γϵr(s,a)+γsp(ss,a)Qπi(s,πi(s))γϵ=TπiQπi(s,a)γϵ=Qπi(s,a)γϵ.
又因 T π Q ~ max ⁡ ( s , a ) ≥ Q π i ( s , a ) − γ ϵ T^\pi \tilde{Q}_{\max }(s, a)\geq Q^{\pi_i}(s, a)-\gamma \epsilon TπQ~max(s,a)Qπi(s,a)γϵ对任意策略成立
T π Q ~ max ⁡ ( s , a ) ≥ Q π i ( s , a ) − γ ϵ f o r ∀ π i ≥ max i Q π i − γ ϵ ≥ Q ~ max ⁡ ( s , a ) − γ − γ ϵ \begin{aligned} T^\pi \tilde{Q}_{\max }(s, a)&\geq Q^{\pi_i}(s, a)-\gamma \epsilon \qquad for \forall \pi_{i}\\ &\geq \text{max}_{i}Q^{\pi_{i}}-\gamma \epsilon\\ &\geq \tilde{Q}_{\max }(s, a)-\gamma-\gamma\epsilon \end{aligned} TπQ~max(s,a)Qπi(s,a)γϵforπimaxiQπiγϵQ~max(s,a)γγϵ
为得出最终结论。我们还需要证明以下事实
T π ( f ( s , a ) + c ) = r ( s , a ) + γ E s ′ ∼ p ( s ′ ∣ s , a ) [ f ( s ′ , π ( s ′ ) ) + c ] = r ( s , a ) + γ E s ′ ∼ p ( s ′ ∣ s , a ) [ f ( s ′ , π ( s ′ ) ) ] + γ ⋅ c = T π ( f ( s , a ) ) + γ ⋅ c \begin{aligned} T^{\pi}(f(s,a)+c)&=r(s,a)+\gamma\mathrm E_{s'\sim p(s'\mid s,a)}[f(s',\pi(s'))+c]\\ &=r(s,a)+\gamma\mathrm E_{s'\sim p(s'\mid s,a)}[f(s',\pi(s'))]+\gamma\cdot c\\ &=T^{\pi}(f(s,a))+\gamma\cdot c \end{aligned} Tπ(f(s,a)+c)=r(s,a)+γEsp(ss,a)[f(s,π(s))+c]=r(s,a)+γEsp(ss,a)[f(s,π(s))]+γc=Tπ(f(s,a))+γc
于是我们可知
T π Q ~ max ⁡ ( s , a ) ≥ Q ~ max ⁡ ( s , a ) − ( 1 + γ ) ϵ T π ( T π Q ~ max ⁡ ( s , a ) ) ≥ T π Q ~ max ⁡ ( s , a ) − γ ( 1 + γ ) ϵ ⋮ ( T π ) k ( Q ~ max ⁡ ( s , a ) ) ≥ ( T π ) k − 1 − γ k − 1 ( 1 + γ ) ϵ \begin{aligned} T^{\pi}\tilde{Q}_{\max }(s, a)&\geq \tilde{Q}_{\max }(s, a)-(1+\gamma)\epsilon\\ T^{\pi}(T^{\pi}\tilde{Q}_{\max }(s, a))&\geq T^{\pi}\tilde{Q}_{\max }(s, a)-\gamma(1+\gamma)\epsilon\\ \vdots\\ (T^{\pi})^{k}(\tilde{Q}_{\max }(s, a))&\geq (T^{\pi})^{k-1}-\gamma^{k-1}(1+\gamma)\epsilon \end{aligned} TπQ~max(s,a)Tπ(TπQ~max(s,a))(Tπ)k(Q~max(s,a))Q~max(s,a)(1+γ)ϵTπQ~max(s,a)γ(1+γ)ϵ(Tπ)k1γk1(1+γ)ϵ
将上式连续相加,且当 k k k趋于无穷时可知
Q π ( s , a ) = lim ⁡ k → ∞ ( T π ) k Q ~ max ⁡ ( s , a ) ≥ Q ~ max ⁡ ( s , a ) − 1 + γ 1 − γ ϵ ≥ Q max ⁡ ( s , a ) − ϵ − 1 + γ 1 − γ ϵ = max ⁡ i Q π i ( s , a ) − 2 1 − γ ϵ \begin{aligned} Q^\pi(s, a) & =\lim _{k \rightarrow \infty}\left(T^\pi\right)^k \tilde{Q}_{\max }(s, a) \\ & \geq \tilde{Q}_{\max }(s, a)-\frac{1+\gamma}{1-\gamma} \epsilon \\ & \geq Q_{\max }(s, a)-\epsilon-\frac{1+\gamma}{1-\gamma} \epsilon\\ & = \max _i Q^{\pi_i}(s, a)-\frac{2}{1-\gamma} \epsilon \end{aligned} Qπ(s,a)=klim(Tπ)kQ~max(s,a)Q~max(s,a)1γ1+γϵQmax(s,a)ϵ1γ1+γϵ=imaxQπi(s,a)1γ2ϵ
证毕

想要证明最后误差界,我们还需借助以下引理

Lemma 1. Let δ i j = max ⁡ s , a ∣ r i ( s , a ) − r j ( s , a ) ∣ \delta_{i j}=\max _{s, a}\left|r_i(s, a)-r_j(s, a)\right| δij=maxs,ari(s,a)rj(s,a). Then,
Q i π i ∗ ( s , a ) − Q i π j ∗ ( s , a ) ≤ 2 δ i j 1 − γ . Q_i^{\pi_i^*}(s, a)-Q_i^{\pi_j^*}(s, a) \leq \frac{2 \delta_{i j}}{1-\gamma} . Qiπi(s,a)Qiπj(s,a)1γ2δij.

proof为简化记号,令 Q i j ( s , a ) ≡ Q i π j ∗ ( s , a ) Q_i^j(s, a) \equiv Q_i^{\pi_j^*}(s, a) Qij(s,a)Qiπj(s,a).
Q i i ( s , a ) − Q i j ( s , a ) = Q i i ( s , a ) − Q j j ( s , a ) + Q j j ( s , a ) − Q i j ( s , a ) ≤ ∣ Q i i ( s , a ) − Q j j ( s , a ) ∣ + ∣ Q j j ( s , a ) − Q i j ( s , a ) ∣ . \begin{aligned} Q_i^i(s, a)-Q_i^j(s, a) & =Q_i^i(s, a)-Q_j^j(s, a)+Q_j^j(s, a)-Q_i^j(s, a) \\ & \leq\left|Q_i^i(s, a)-Q_j^j(s, a)\right|+\left|Q_j^j(s, a)-Q_i^j(s, a)\right| . \end{aligned} Qii(s,a)Qij(s,a)=Qii(s,a)Qjj(s,a)+Qjj(s,a)Qij(s,a) Qii(s,a)Qjj(s,a) + Qjj(s,a)Qij(s,a) .
Δ i j = max ⁡ s , a ∣ Q i i ( s , a ) − Q j j ( s , a ) ∣ \Delta_{i j}=\max _{s, a}\left|Q_i^i(s, a)-Q_j^j(s, a)\right| Δij=maxs,a Qii(s,a)Qjj(s,a) .
∣ Q i i ( s , a ) − Q j j ( s , a ) ∣ = ∣ r i ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) max ⁡ b Q i i ( s ′ , b ) − r j ( s , a ) − γ ∑ s ′ p ( s ′ ∣ s , a ) max ⁡ b Q j j ( s ′ , b ) ∣ = ∣ r i ( s , a ) − r j ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) ( max ⁡ b Q i i ( s ′ , b ) − max ⁡ b Q j j ( s ′ , b ) ) ∣ ≤ ∣ r i ( s , a ) − r j ( s , a ) ∣ + γ ∑ s ′ p ( s ′ ∣ s , a ) ∣ max ⁡ b Q i i ( s ′ , b ) − max ⁡ b Q j j ( s ′ , b ) ∣ ≤ ∣ r i ( s , a ) − r j ( s , a ) ∣ + γ ∑ s ′ p ( s ′ ∣ s , a ) max ⁡ b ∣ Q i i ( s ′ , b ) − Q j j ( s ′ , b ) ∣ ≤ δ i j + γ Δ i j . \begin{aligned} \left|Q_i^i(s, a)-Q_j^j(s, a)\right| & =\left|r_i(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \max _b Q_i^i\left(s^{\prime}, b\right)-r_j(s, a)-\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \max _b Q_j^j\left(s^{\prime}, b\right)\right| \\ & =\left|r_i(s, a)-r_j(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right)\left(\max _b Q_i^i\left(s^{\prime}, b\right)-\max _b Q_j^j\left(s^{\prime}, b\right)\right)\right| \\ & \leq\left|r_i(s, a)-r_j(s, a)\right|+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right)\left|\max _b Q_i^i\left(s^{\prime}, b\right)-\max _b Q_j^j\left(s^{\prime}, b\right)\right| \\ & \leq\left|r_i(s, a)-r_j(s, a)\right|+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \max _b\left|Q_i^i\left(s^{\prime}, b\right)-Q_j^j\left(s^{\prime}, b\right)\right| \\ & \leq \delta_{i j}+\gamma \Delta_{i j} . \end{aligned} Qii(s,a)Qjj(s,a) = ri(s,a)+γsp(ss,a)bmaxQii(s,b)rj(s,a)γsp(ss,a)bmaxQjj(s,b) = ri(s,a)rj(s,a)+γsp(ss,a)(bmaxQii(s,b)bmaxQjj(s,b)) ri(s,a)rj(s,a)+γsp(ss,a) bmaxQii(s,b)bmaxQjj(s,b) ri(s,a)rj(s,a)+γsp(ss,a)bmax Qii(s,b)Qjj(s,b) δij+γΔij.

从上式中可知
Δ i j ≤ 1 1 − γ δ i j . \Delta_{i j} \leq \frac{1}{1-\gamma} \delta_{i j} . Δij1γ1δij.

定义 Δ i j ′ = \Delta_{i j}^{\prime}= Δij= max ⁡ s , a ∣ Q i i ( s , a ) − Q i j ( s , a ) ∣ \max _{s, a}\left|Q_i^i(s, a)-Q_i^j(s, a)\right| maxs,a Qii(s,a)Qij(s,a) .
∣ Q j j ( s , a ) − Q i j ( s , a ) ∣ = ∣ r j ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) Q j j ( s ′ , π j ∗ ( s ′ ) ) − r i ( s , a ) − γ ∑ s ′ p ( s ′ ∣ s , a ) Q i j ( s ′ , π j ∗ ( s ′ ) ) ∣ = ∣ r i ( s , a ) − r j ( s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) ( Q j j ( s ′ , π j ∗ ( s ′ ) ) − Q i j ( s ′ , π j ∗ ( s ′ ) ) ) ∣ ≤ ∣ r i ( s , a ) − r j ( s , a ) ∣ + γ ∑ s ′ p ( s ′ ∣ s , a ) ∣ Q j j ( s ′ , π j ∗ ( s ′ ) ) − Q i j ( s ′ , π j ∗ ( s ′ ) ) ∣ ≤ δ i j + γ Δ i j ′ . \begin{aligned} \left|Q_j^j(s, a)-Q_i^j(s, a)\right| & =\left|r_j(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) Q_j^j\left(s^{\prime}, \pi_j^*\left(s^{\prime}\right)\right)-r_i(s, a)-\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) Q_i^j\left(s^{\prime}, \pi_j^*\left(s^{\prime}\right)\right)\right| \\ & =\left|r_i(s, a)-r_j(s, a)+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right)\left(Q_j^j\left(s^{\prime}, \pi_j^*\left(s^{\prime}\right)\right)-Q_i^j\left(s^{\prime}, \pi_j^*\left(s^{\prime}\right)\right)\right)\right| \\ & \leq\left|r_i(s, a)-r_j(s, a)\right|+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right)\left|Q_j^j\left(s^{\prime}, \pi_j^*\left(s^{\prime}\right)\right)-Q_i^j\left(s^{\prime}, \pi_j^*\left(s^{\prime}\right)\right)\right| \\ & \leq \delta_{i j}+\gamma \Delta_{i j}^{\prime} . \end{aligned} Qjj(s,a)Qij(s,a) = rj(s,a)+γsp(ss,a)Qjj(s,πj(s))ri(s,a)γsp(ss,a)Qij(s,πj(s)) = ri(s,a)rj(s,a)+γsp(ss,a)(Qjj(s,πj(s))Qij(s,πj(s))) ri(s,a)rj(s,a)+γsp(ss,a) Qjj(s,πj(s))Qij(s,πj(s)) δij+γΔij.

同样可知
Δ i j ′ ≤ 1 1 − γ δ i j . \Delta_{i j}^{\prime} \leq \frac{1}{1-\gamma} \delta_{i j} . Δij1γ1δij.

证毕

Theorem 2. Let M i ∈ M ϕ M_i \in \mathcal{M}^\phi MiMϕ and let Q i π j ∗ Q_i^{\pi_j^*} Qiπj be the value function of an optimal policy of M j ∈ M ϕ M_j \in \mathcal{M}^\phi MjMϕ when executed in M i M_i Mi. Given the set { Q ~ i π 1 ∗ , Q ~ i π 2 ∗ , … , Q ~ i π n ∗ } \left\{\tilde{Q}_i^{\pi_1^*}, \tilde{Q}_i^{\pi_2^*}, \ldots, \tilde{Q}_i^{\pi_n^*}\right\} {Q~iπ1,Q~iπ2,,Q~iπn} such that
∣ Q i π j ∗ ( s , a ) − Q ~ i π j ∗ ( s , a ) ∣ ≤ ϵ  for all  s ∈ S , a ∈ A , and  j ∈ { 1 , 2 , … , n } , \left|Q_i^{\pi_j^*}(s, a)-\tilde{Q}_i^{\pi_j^*}(s, a)\right| \leq \epsilon \text { for all } s \in S, a \in A \text {, and } j \in\{1,2, \ldots, n\}, Qiπj(s,a)Q~iπj(s,a) ϵ for all sS,aA, and j{1,2,,n},
let
π ( s ) ∈ argmax ⁡ a max ⁡ j Q ~ i π j ∗ ( s , a ) . \pi(s) \in \underset{a}{\operatorname{argmax}} \max _j \tilde{Q}_i^{\pi_j^*}(s, a) . π(s)aargmaxjmaxQ~iπj(s,a).

Finally, let ϕ max ⁡ = max ⁡ s , a ∥ ϕ ( s , a ) ∥ \phi_{\max }=\max _{s, a}\|\phi(s, a)\| ϕmax=maxs,aϕ(s,a), where ∥ ⋅ ∥ \|\cdot\| is the norm induced by the inner product adopted. Then,
Q i ∗ ( s , a ) − Q i π ( s , a ) ≤ 2 1 − γ ( ϕ max ⁡ min ⁡ j ∥ w i − w j ∥ + ϵ ) . Q_i^*(s, a)-Q_i^\pi(s, a) \leq \frac{2}{1-\gamma}\left(\phi_{\max } \min _j\left\|\mathbf{w}_i-\mathbf{w}_j\right\|+\epsilon\right) . Qi(s,a)Qiπ(s,a)1γ2(ϕmaxjminwiwj+ϵ).

proof:
Q i ∗ ( s , a ) − Q i π ( s , a ) ≤ Q i ∗ ( s , a ) − Q i π j ∗ ( s , a ) + 2 1 − γ ϵ ≤ 2 1 − γ max ⁡ s , a ∣ r i ( s , a ) − r j ( s , a ) ∣ + 2 1 − γ ϵ = 2 1 − γ max ⁡ s , a ∣ ϕ ( s , a ) ⊤ w i − ϕ ( s , a ) ⊤ w j ∣ + 2 1 − γ ϵ = 2 1 − γ max ⁡ s , a ∣ ϕ ( s , a ) ⊤ ( w i − w j ) ∣ + 2 1 − γ ϵ ≤ 2 1 − γ max ⁡ s , a ∥ ϕ ( s , a ) ∥ ∥ w i − w j ∥ + 2 1 − γ ϵ = 2 ϕ max ⁡ 1 − γ ∥ w i − w j ∥ + 2 1 − γ ϵ . \begin{aligned} Q_i^*(s, a)-Q_i^\pi(s, a) & \leq Q_i^*(s, a)-Q_i^{\pi_j^*}(s, a)+\frac{2}{1-\gamma} \epsilon \\ & \leq \frac{2}{1-\gamma} \max _{s, a}\left|r_i(s, a)-r_j(s, a)\right|+\frac{2}{1-\gamma} \epsilon \\ & =\frac{2}{1-\gamma} \max _{s, a}\left|\phi(s, a)^{\top} \mathbf{w}_i-\phi(s, a)^{\top} \mathbf{w}_j\right|+\frac{2}{1-\gamma} \epsilon \\ & =\frac{2}{1-\gamma} \max _{s, a}\left|\phi(s, a)^{\top}\left(\mathbf{w}_i-\mathbf{w}_j\right)\right|+\frac{2}{1-\gamma} \epsilon \\ & \leq \frac{2}{1-\gamma} \max _{s, a}\|\phi(s, a)\|\left\|\mathbf{w}_i-\mathbf{w}_j\right\|+\frac{2}{1-\gamma} \epsilon \\ & =\frac{2 \phi_{\max }}{1-\gamma}\left\|\mathbf{w}_i-\mathbf{w}_j\right\|+\frac{2}{1-\gamma} \epsilon . \end{aligned} Qi(s,a)Qiπ(s,a)Qi(s,a)Qiπj(s,a)+1γ2ϵ1γ2s,amaxri(s,a)rj(s,a)+1γ2ϵ=1γ2s,amax ϕ(s,a)wiϕ(s,a)wj +1γ2ϵ=1γ2s,amax ϕ(s,a)(wiwj) +1γ2ϵ1γ2s,amaxϕ(s,a)wiwj+1γ2ϵ=1γ2ϕmaxwiwj+1γ2ϵ.文章来源地址https://www.toymoban.com/news/detail-854782.html

到了这里,关于迁移强化学习论文笔记(一)(Successor Features)的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • 分层强化学习 综述论文阅读 Hierarchical Reinforcement Learning: A Comprehensive Survey

    分层强化学习可以通过将困难的长期决策任务分解为更简单的子任务,提升强化学习算法的性能。 分层强化学习方法主要涉及:使用HRL学习分层策略、子任务发现、迁移学习和多智能体学习四个主要挑战。 强化学习算法的一个痛点:如果任务的长度很长,状态空间和动作空

    2024年02月04日
    浏览(32)
  • 图像处理之《寻找和隐藏:通过深度强化学习的对抗隐写术》论文阅读

    一、文章摘要 图像隐写术的目的是将一个完整大小的图像(称为秘密)隐藏到另一个图像(称为封面)中。以往的图像隐写算法只能在一个封面中隐藏一个秘密。在这篇论文中, 我们提出了一个自适应局部图像隐写(AdaSteg)系统,允许缩放和位置自适应图像隐写 。该系统通过在局部

    2024年03月14日
    浏览(37)
  • 论文阅读--基于迁移学习和机器学习的UAV图像中水稻幼苗检测

    Title: Rice Seedling Detection in UA V Images Using Transfer Learning and Machine Learning Abstract: To meet demand for agriculture products, researchers have recently focused on precision agriculture to increase crop production with less input. Crop detection based on computer vision with unmanned aerial vehicle (UA V)-acquired images plays a vital role in

    2024年02月13日
    浏览(32)
  • 强化学习-DQN改进及一些强化学习路由优化论文笔记

    通用超参数 Duel Structure VS→该state在当前policy下的value QSA→该state进行这个action在当前policy下的value advantage = VS - QSA 裁剪区域的确定? 34 194按行输出min,33 193min为90*90 background knowledge [bisect Module] python自带的二分查找的包 基本使用 bisect with list 在一个increasing array插入一个元素

    2024年04月13日
    浏览(34)
  • 《论文阅读》通过生成会话模型的迁移学习会话中的情感识别

    前言 你是否也对于理解论文存在困惑? 你是否也像我之前搜索论文解读,得到只是中文翻译的解读后感到失望? 小白如何从零读懂论文?和我一起来探索吧! 今天为大家带来的是《Emotion Recognition in Conversations with Transfer Learning from Generative Conversation Modeling》 出版:Journal o

    2024年02月14日
    浏览(33)
  • 论文阅读《Generalizing Face Forgery Detection with High-frequency Features》

    高频噪声分析会过滤掉图像的颜色内容信息。 本文设计了三个模块来充分利用高频特征, 1.多尺度高频特征提取模块 2.双跨模态注意模块 3.残差引导空间注意模块(也在一定程度上体现了两个模态的交互) SRM是用于过滤图像的高频噪声 输入的图像X,共两个分支,一部分是用

    2024年01月25日
    浏览(33)
  • 强化学习论文阅读——自动分组Self-Organized Group for Cooperative Multi-Agent Reinforcement Learning

    Self-Organized Group for Cooperative Multi-Agent Reinforcement Learning 这是一篇发表在NeurIPS2022的论文,文章提出了一种分组算法,旨在提高算法零样本泛化能力 1 论文背景 CTDE :集中训练分散执行,在训练阶段将所有智能体的Q值加到一起去训练,训练完了之后在执行阶段大家各自执行自己

    2024年01月25日
    浏览(42)
  • 论文笔记: 深度学习速度模型构建的层次迁移学习方法 (未完)

    摘要 : 分享对论文的理解, 原文见 Jérome Simon, Gabriel Fabien-Ouellet, Erwan Gloaguen, and Ishan Khurjekar, Hierarchical transfer learning for deep learning velocity model building, Geophysics, 2003, R79–R93. 这次的层次迁移应该指从 1D 到 2D 再到 3D. 深度学习具有使用最少的资源 (这里应该是计算资源, 特别是预测

    2024年02月10日
    浏览(31)
  • 论文阅读:Towards a Unified View of Parameter-Efficient Transfer Learning对参数高效迁移学习的统一看法

    Towards a Unified View of Parameter-Efficient Transfer Learning对参数高效迁移学习的统一看法 参数高效的微调在NLP中是较新的方法,早期的代表工作是2019年发表的adapter。2021年开始有越来越多的人关注这一方向并有不同的方法被提出, 代表工作有prefix tuning, P-tuning, prompt tuning, bitfit, LoRA等

    2024年02月08日
    浏览(37)
  • 对比学习论文阅读:CoCLR算法笔记

    标题:Self-supervised Co-training for Video Representation Learning 会议:NIPS2020 论文地址:https://dl.acm.org/doi/abs/10.5555/3495724.3496201 官方代码:https://www.robots.ox.ac.uk/~vgg/research/CoCLR/ 作者单位:牛津大学 本文的研究目标是纯视觉的自监督视频表征学习。我们做出了以下贡献:①我们研究了在

    2024年02月03日
    浏览(48)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包