原论文demo - metamotivo , GitHub - metamotivo
思路
这种无奖励强化学习本质上是通过采用的方式,将奖励 r ( s ) r(s) r ( s ) 通过 B ( s ) B(s) B ( s ) 映射到低维空间 R d \mathbb{R}^d R d 中,得到低维空间表示 z ∈ R d z\in\mathbb{R}^d z ∈ R d ,再通过 π \pi π 将低维空间中映射到策略空间 Π \Pi Π 中。但训练还差对价值函数 Q π ( s , a ) Q^{\pi}(s,a) Q π ( s , a ) 估计,因此还需引入 F ( s , a ∣ π ) ∈ R d F(s,a|\pi)\in\mathbb{R}^d F ( s , a ∣ π ) ∈ R d 将策略重新映射回低维空间,通过内积得到价值函数估计 Q π ( s , a ) = ⟨ F ( s , a ∣ π ) , B ( s ) ⟩ = F ( s , a ∣ π ) T B ( s ) Q^{\pi}(s,a) = \lang F(s,a|\pi), B(s)\rang = F(s,a|\pi)^TB(s) Q π ( s , a ) = ⟨ F ( s , a ∣ π ) , B ( s ) ⟩ = F ( s , a ∣ π ) T B ( s ) ,流程图如下所示:
这里将低维空间 z z z 限制在半径为 d \sqrt{d} d 的球面上,便于表示,也便于作为神经网络的输入
FB理论推导
前置芝士
首先进行符号定义,设 S , A S,A S , A 分别为状态空间和动作空间,正整数 d ∈ Z + d\in \mathbb{Z}^+ d ∈ Z + 为低维空间维数,Pr t ( s ′ ∣ s , a , π ) \text{Pr}_{t}(s'|s,a,\pi) Pr t ( s ′ ∣ s , a , π ) ,表示从 s , a s,a s , a 状态 s s s 执行动作 a a a 出发通过策略 π \pi π 在第 t t t 步到达 s ′ s' s ′ 的概率;类似地,E [ r t ∣ s , a , π ] \mathbb{E}[r_t|s,a,\pi] E [ r t ∣ s , a , π ] 表示从状态 s s s 执行动作 a a a 出发通过策略 π \pi π 在第 t t t 步获得奖励 r ( s t ) r(s_t) r ( s t ) 的期望
上述思路可以用完整的理论进行描述,定义
Forward Function: F ( s , a , z ) : S × A × R d → R d F(s,a,z): S\times A\times \mathbb{R}^d\to \mathbb{R}^d F ( s , a , z ) : S × A × R d → R d . 在上述思路中,F F F 将策略 Π \Pi Π 映射回 R d \mathbb{R}^d R d 中,但由于 π \pi π 难以表述,且存在 Π \Pi Π 和 R d \mathbb{R}^d R d 的映射关系 π \pi π ,因此可以通过 z z z 来描述 Π \Pi Π 中的元素
Backward Function: B ( s ) : S → R d B(s): S\to \mathbb{R}^d B ( s ) : S → R d
Policy Function: π ( s , z ) : S × R d → A \pi(s,z): S\times \mathbb{R}^d\to A π ( s , z ) : S × R d → A
那么 F , B , π F,B,\pi F , B , π 应该如何优化,他们应该满足什么关系?是否应该从Bellman方程寻找?我们与价值函数相关的Bellman方程,都带有奖励无法推导,但是后继度量 (Successor measures)与奖励无关。
定义1(后继度量)
∀ X ⊂ S , s ∈ S , a ∈ A , π ∈ Π \forall X\subset S, s\in S, a\in A, \pi\in \Pi ∀ X ⊂ S , s ∈ S , a ∈ A , π ∈ Π ,后继度量 M π : S × S × A → R M^{\pi}: S\times S\times A\to \mathbb{R} M π : S × S × A → R 为
M π ( X ∣ s , a ) : = ∑ t = 0 ∞ γ t Pr t + 1 ( s ′ ∈ X ∣ s , a , π ) M^{\pi}(X|s,a):=\sum_{t=0}^{\infty}\gamma^t\text{Pr}_{t+1}(s'\in X|s,a,\pi)
M π ( X ∣ s , a ) : = t = 0 ∑ ∞ γ t Pr t + 1 ( s ′ ∈ X ∣ s , a , π )
其中 γ ∈ R \gamma\in\mathbb{R} γ ∈ R 为折扣系数
后继度量可以表示在策略 π \pi π 下,从状态动作 s , a s,a s , a 出发,到达 X X X 中状态的累计折后概率大小
P.S. 在PPO算法中,也用到了后继度量,即Blog - PPO 推论1 策略回报参数形式 中的 ρ π ~ ( S ) \rho_{\tilde{\pi}}(S) ρ π ~ ( S )
不难发现,后继度量的Bellman方程为
M π ( X ∣ s , a ) = P ( X ∣ s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) a ′ ∼ π ( ⋅ ∣ s ′ ) [ M π ( X ∣ s ′ , a ′ ) ] (1) \tag{1}M^{\pi}(X|s,a) = P(X|s,a) + \gamma\mathbb{E}_{\substack{s'\sim p(\cdot|s,a)\\a'\sim\pi(\cdot|s')}}[M^{\pi}(X|s',a')]
M π ( X ∣ s , a ) = P ( X ∣ s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) a ′ ∼ π ( ⋅ ∣ s ′ ) [ M π ( X ∣ s ′ , a ′ ) ] ( 1 )
命题1(后继度量与动作价值函数)
设动作价值函数 Q π ( s , a ) : = ∑ t = 0 ∞ γ t E [ r t + 1 ∣ s , a , π ] Q^{\pi}(s,a) := \sum_{t=0}^{\infty}\gamma^t\mathbb{E}[r_{t+1}|s,a,\pi] Q π ( s , a ) : = ∑ t = 0 ∞ γ t E [ r t + 1 ∣ s , a , π ] ,则
Q π ( s , a ) = ∫ s ′ ∈ S M π ( s ′ ∣ s , a ) r ( s ′ ) d s ′ Q^{\pi}(s,a) = \int_{s'\in S}M_{\pi}(s'|s,a)r(s')\mathrm{d}s'
Q π ( s , a ) = ∫ s ′ ∈ S M π ( s ′ ∣ s , a ) r ( s ′ ) d s ′
证明 :
Q π ( s , a ) = ∑ t = 0 ∞ γ t E [ r t + 1 ∣ s , a , π ] = ∑ t = 0 ∞ γ t E s t + 1 [ r ( s t + 1 ) ∣ s , a , π ] = ∑ t = 0 ∞ γ t ∫ s ′ ∈ S Pr t + 1 ( s ′ ∣ s , a , π ) r ( s ′ ) d s ′ = ∫ s ′ ∈ S r ( s ′ ) ∑ t = 0 ∞ γ t Pr t + 1 ( s ′ ∣ s , a , π ) d s ′ = ∫ s ′ ∈ S M π ( s ′ ∣ s , a ) r ( s ′ ) d s ′ \begin{aligned}
Q^{\pi}(s,a) =&\ \sum_{t=0}^{\infty}\gamma^t\mathbb{E}[r_{t+1}|s,a,\pi] = \sum_{t=0}^{\infty}\gamma^t\mathbb{E}_{s_{t+1}}[r(s_{t+1})|s,a,\pi] \\
=&\ \sum_{t=0}^{\infty}\gamma^t\int_{s'\in S}\text{Pr}_{t+1}(s'|s,a,\pi)r(s')\mathrm{d}s'\\
=&\ \int_{s'\in S}r(s')\sum_{t=0}^{\infty}\gamma^t\text{Pr}_{t+1}(s'|s,a,\pi)\mathrm{d}s'\\
=&\ \int_{s'\in S}M_{\pi}(s'|s,a)r(s')\mathrm{d}s'\\
\end{aligned}
Q π ( s , a ) = = = = t = 0 ∑ ∞ γ t E [ r t + 1 ∣ s , a , π ] = t = 0 ∑ ∞ γ t E s t + 1 [ r ( s t + 1 ) ∣ s , a , π ] t = 0 ∑ ∞ γ t ∫ s ′ ∈ S Pr t + 1 ( s ′ ∣ s , a , π ) r ( s ′ ) d s ′ ∫ s ′ ∈ S r ( s ′ ) t = 0 ∑ ∞ γ t Pr t + 1 ( s ′ ∣ s , a , π ) d s ′ ∫ s ′ ∈ S M π ( s ′ ∣ s , a ) r ( s ′ ) d s ′
这里假如我们将 M π z M_{\pi_z} M π z 分解为 F ( s , a , z ) T B ( s ′ ) F(s,a,z)^T B(s') F ( s , a , z ) T B ( s ′ ) ,不难发现,后面 B ( s ′ ) r ( s ′ ) B(s')r(s') B ( s ′ ) r ( s ′ ) 就正好是将 r r r 映射到低维空间 z z z 上,而 F ( ⋅ , ⋅ , z ) F(\cdot,\cdot,z) F ( ⋅ , ⋅ , z ) 就是将 z z z 对应的策略 π ( ⋅ , z ) \pi(\cdot,z) π ( ⋅ , z ) 映射回低维空间中,于是有
命题2(M_pi的FB分解)
∀ z ∈ R d \forall z\in\mathbb{R}^d ∀ z ∈ R d ,若存在 F ( s , a , z ) : S × A × R d → R d , B ( s ) : S → R d , π z ( s ) = π ( s , z ) : S × R d → A F(s,a,z): S\times A\times \mathbb{R}^d\to \mathbb{R}^d, B(s): S\to \mathbb{R}^d, \pi_z(s)=\pi(s,z): S\times \mathbb{R}^d\to A F ( s , a , z ) : S × A × R d → R d , B ( s ) : S → R d , π z ( s ) = π ( s , z ) : S × R d → A ,分布 ρ : S → R \rho: S\to \mathbb{R} ρ : S → R ,使得
M π z ( X ∣ s , a ) = ∫ s ′ ∈ X F ( s , a , z ) T B ( s ′ ) ρ ( s ′ ) d s ′ = F ( s , a , z ) T E s ′ ∼ ρ , s ′ ∈ X [ B ( s ′ ) ] M_{\pi_z}(X|s,a) = \int_{s'\in X}F(s,a,z)^TB(s')\rho(s')\mathrm{d}s' = F(s,a,z)^T\mathbb{E}_{s'\sim\rho,s'\in X}[B(s')]
M π z ( X ∣ s , a ) = ∫ s ′ ∈ X F ( s , a , z ) T B ( s ′ ) ρ ( s ′ ) d s ′ = F ( s , a , z ) T E s ′ ∼ ρ , s ′ ∈ X [ B ( s ′ ) ]
则当 z = E s ∼ ρ [ B ( s ) r ( s ) ] z=\mathbb{E}_{s\sim\rho}[B(s)r(s)] z = E s ∼ ρ [ B ( s ) r ( s ) ] 时,Q π z ( s , a ) = F ( s , a , z ) T z Q^{\pi_z}(s,a)=F(s,a,z)^Tz Q π z ( s , a ) = F ( s , a , z ) T z .
证明 :
Q π z ( s , a ) = ∫ s ′ ∈ S M π z ( s ′ ∣ s , a ) r ( s ′ ) d s ′ = ∫ s ′ ∈ S F ( s , a , z ) T B ( s ′ ) ρ ( s ′ ) r ( s ′ ) d s ′ = F ( s , a , z ) T E s ∼ ρ [ B ( s ) r ( s ) ] = F ( s , a , z ) T z \begin{aligned}
Q^{\pi_z}(s,a) =&\ \int_{s'\in S}M_{\pi_z}(s'|s,a)r(s')\mathrm{d}s'\\
=&\ \int_{s'\in S}F(s,a,z)^TB(s')\rho(s')r(s')\mathrm{d}s'\\
=&\ F(s,a,z)^T\mathbb{E}_{s\sim \rho}[B(s)r(s)] = F(s,a,z)^Tz
\end{aligned}
Q π z ( s , a ) = = = ∫ s ′ ∈ S M π z ( s ′ ∣ s , a ) r ( s ′ ) d s ′ ∫ s ′ ∈ S F ( s , a , z ) T B ( s ′ ) ρ ( s ′ ) r ( s ′ ) d s ′ F ( s , a , z ) T E s ∼ ρ [ B ( s ) r ( s ) ] = F ( s , a , z ) T z
这样我们就把奖励函数 r r r 完全映射到 R d \mathbb{R}^d R d 中,整个模型的更新与 r r r 无关。
定理1(FB约束及无监督RL)
假设存在 F ( s , a , z ) : S × A × R d → R d , B ( s ) : S → R d , π z ( s ) : S × R d → A F(s,a,z): S\times A\times \mathbb{R}^d\to \mathbb{R}^d, B(s): S\to \mathbb{R}^d, \pi_z(s): S\times \mathbb{R}^d\to A F ( s , a , z ) : S × A × R d → R d , B ( s ) : S → R d , π z ( s ) : S × R d → A 使得 ∀ z ∈ R d , s , s ′ ∈ S , a ∈ A , X ⊂ S \forall z\in\mathbb{R}^d, s,s'\in S, a\in A, X\subset S ∀ z ∈ R d , s , s ′ ∈ S , a ∈ A , X ⊂ S 有
M π z ( X ∣ s , a ) = F ( s , a , z ) T E s ′ ∼ ρ , s ′ ∈ X [ B ( s ′ ) ] (2) \tag{2}M_{\pi_z}(X|s,a)=F(s,a,z)^T\mathbb{E}_{s'\sim\rho, s'\in X}[B(s')]
M π z ( X ∣ s , a ) = F ( s , a , z ) T E s ′ ∼ ρ , s ′ ∈ X [ B ( s ′ ) ] ( 2 )
π z ( s ) = arg max a ∈ A F ( s , a , z ) T z (3) \tag{3}\pi_z(s)=\argmax_{a\in A}F(s,a,z)^Tz
π z ( s ) = a ∈ A a r g m a x F ( s , a , z ) T z ( 3 )
另一种写法
{ M π z ( s ′ ∣ s , a ) = F ( s , a , z ) T B ( s ′ ) ρ ( s ′ ) π z ( s ) = arg max a ∈ A F ( s , a , z ) T z \begin{cases}
M_{\pi_z}(s'|s,a)=F(s,a,z)^TB(s')\rho(s')\\
\pi_z(s)=\argmax_{a\in A}F(s,a,z)^Tz
\end{cases}
{ M π z ( s ′ ∣ s , a ) = F ( s , a , z ) T B ( s ′ ) ρ ( s ′ ) π z ( s ) = a r g m a x a ∈ A F ( s , a , z ) T z
对任意奖励函数 r : A → R r: A\to \mathbb{R} r : A → R ,令 z = E s ∼ ρ [ B ( s ) r ( s ) ] z=\mathbb{E}_{s\sim\rho}[B(s)r(s)] z = E s ∼ ρ [ B ( s ) r ( s ) ] ,有 π z \pi_z π z 是 MDP = { S , A , r , P , μ } \text{MDP}=\{S,A,r,P,\mu\} MDP = { S , A , r , P , μ } 下的最优策略.
证明: 由命题2可得 π z ( s ) = arg max a ∈ A Q π z ( s , a ) \pi_z(s)=\argmax_{a\in A}Q^{\pi_z}(s,a) π z ( s ) = a r g m a x a ∈ A Q π z ( s , a ) ,即 π z ( s ) \pi_z(s) π z ( s ) 为最优策略.
这样我们就找到一个目标等式 ( 2 ) (2) ( 2 ) 式,以及策略优化目标 ( 3 ) (3) ( 3 ) 式,( 3 ) (3) ( 3 ) 式可以用经典RL算法解决(PPO, TD3, SAC等),而 ( 1 ) (1) ( 1 ) 式则需要通过Bellman方程找到损失函数(与DQN的Q值损失类似)
定义2(FB损失)
将 F , B F,B F , B 用神经网络参数化 θ , ω \theta, \omega θ , ω ,则FB损失定义如下
L F B ( θ , ω ) : = 1 2 [ E s ′ ∼ p ( ⋅ ∣ s , a ) , a ′ ∼ π ( ⋅ ∣ s ′ ) s ′ ′ ∼ ρ , s ′ ′ ∈ X [ F θ ( s , a , z ) T B ω ( s ′ ′ ) − γ F ˉ θ ( s ′ , a ′ , z ) T B ˉ ω ( s ′ ′ ) ] 2 − F θ ( s , a , z ) T E s ′ ∼ ρ s ′ ∈ X [ B ω ( s ′ ) ] \begin{aligned}
\mathcal{L}_{FB}(\theta,\omega):=&\ \frac{1}{2}\left[\mathbb{E}_{\substack{s'\sim p(\cdot|s,a),a'\sim\pi(\cdot|s')\\s''\sim\rho,s''\in X}}[F_{\theta}(s,a,z)^TB_{\omega}(s'')-\gamma\bar{F}_{\theta}(s',a',z)^T\bar{B}_{\omega}(s'')\right]^2 \\
&\qquad - F_{\theta}(s,a,z)^T\mathbb{E}_{\substack{s'\sim\rho\\s'\in X}}[B_{\omega}(s')]
\end{aligned}
L F B ( θ , ω ) : = 2 1 [ E s ′ ∼ p ( ⋅ ∣ s , a ) , a ′ ∼ π ( ⋅ ∣ s ′ ) s ′ ′ ∼ ρ , s ′ ′ ∈ X [ F θ ( s , a , z ) T B ω ( s ′ ′ ) − γ F ˉ θ ( s ′ , a ′ , z ) T B ˉ ω ( s ′ ′ ) ] 2 − F θ ( s , a , z ) T E s ′ ∼ ρ s ′ ∈ X [ B ω ( s ′ ) ]
解释 :将 M π M^{\pi} M π 的重表示 ( 2 ) (2) ( 2 ) 式,带入其Bellman方程 ( 1 ) (1) ( 1 ) 式,我们期望将减小Bellman残差作为优化目标,即将左式作为当前网络,减去右式估计值求 ℓ 2 \ell_2 ℓ 2 范数,从而作为当前参数的损失:
L ( θ , ω ) = [ F θ ( s , a , z ) E s ′ ∼ ρ s ′ ∈ X [ B ω ( s ′ ) ] − P ( X ∣ s , a ) − γ E s ′ ∼ p ( ⋅ ∣ s , a ) a ′ ∼ π ( ⋅ ∣ s ′ ) F ˉ θ ( s ′ , a ′ , z ) E s ′ ′ ∼ ρ s ′ ′ ∈ X [ B ˉ ω ( s ′ ′ ) ] ] 2 = [ E s ′ ∼ p ( ⋅ ∣ s , a ) , a ′ ∼ π ( ⋅ ∣ s ′ ) s ′ ′ ∼ ρ , s ′ ′ ∈ X [ F θ ( s , a , z ) B ω ( s ′ ′ ) − γ F ˉ θ ( s ′ , a ′ , z ) B ˉ ω ( s ′ ′ ) ] ] 2 − 2 P ( X ∣ s , a ) F θ ( s , a , z ) E s ′ ∼ ρ s ′ ∈ X [ B ω ( s ′ ) ] − 2 γ P ( X ∣ s , a ) E s ′ ∼ p ( ⋅ ∣ s , a ) a ′ ∼ π ( ⋅ ∣ s ′ ) F ˉ θ ( s ′ , a ′ , z ) E s ′ ′ ∼ ρ s ′ ′ ∈ X [ B ˉ ω ( s ′ ′ ) ] \begin{aligned}
\mathcal{L}(\theta,\omega) =&\ \left[F_{\theta}(s,a,z)\mathbb{E}_{\substack{s'\sim\rho\\s'\in X}}[B_{\omega}(s')]-P(X|s,a)-\gamma\mathbb{E}_{\substack{s'\sim p(\cdot|s,a)\\a'\sim\pi(\cdot|s')}}\bar{F}_{\theta}(s',a',z)\mathbb{E}_{\substack{s''\sim\rho\\s''\in X}}[\bar{B}_{\omega}(s'')]\right]^2\\
=&\ \left[\mathbb{E}_{\substack{s'\sim p(\cdot|s,a),a'\sim\pi(\cdot|s')\\s''\sim\rho,s''\in X}}[F_{\theta}(s,a,z)B_{\omega}(s'')-\gamma\bar{F}_{\theta}(s',a',z)\bar{B}_{\omega}(s'')]\right]^2\\
&\quad -2P(X|s,a)F_{\theta}(s,a,z)\mathbb{E}_{\substack{s'\sim\rho\\s'\in X}}[B_{\omega}(s')] - 2\gamma P(X|s,a)\mathbb{E}_{\substack{s'\sim p(\cdot|s,a)\\a'\sim\pi(\cdot|s')}}\bar{F}_{\theta}(s',a',z)\mathbb{E}_{\substack{s''\sim\rho\\s''\in X}}[\bar{B}_{\omega}(s'')]
\end{aligned}
L ( θ , ω ) = = [ F θ ( s , a , z ) E s ′ ∼ ρ s ′ ∈ X [ B ω ( s ′ ) ] − P ( X ∣ s , a ) − γ E s ′ ∼ p ( ⋅ ∣ s , a ) a ′ ∼ π ( ⋅ ∣ s ′ ) F ˉ θ ( s ′ , a ′ , z ) E s ′ ′ ∼ ρ s ′ ′ ∈ X [ B ˉ ω ( s ′ ′ ) ] ] 2 [ E s ′ ∼ p ( ⋅ ∣ s , a ) , a ′ ∼ π ( ⋅ ∣ s ′ ) s ′ ′ ∼ ρ , s ′ ′ ∈ X [ F θ ( s , a , z ) B ω ( s ′ ′ ) − γ F ˉ θ ( s ′ , a ′ , z ) B ˉ ω ( s ′ ′ ) ] ] 2 − 2 P ( X ∣ s , a ) F θ ( s , a , z ) E s ′ ∼ ρ s ′ ∈ X [ B ω ( s ′ ) ] − 2 γ P ( X ∣ s , a ) E s ′ ∼ p ( ⋅ ∣ s , a ) a ′ ∼ π ( ⋅ ∣ s ′ ) F ˉ θ ( s ′ , a ′ , z ) E s ′ ′ ∼ ρ s ′ ′ ∈ X [ B ˉ ω ( s ′ ′ ) ]
对 θ , ω \theta,\omega θ , ω 取 min \min min 可以消去第三项,取 X X X 为多个轨迹片段,则对于 ( s , a , s ′ ) (s,a,s') ( s , a , s ′ ) 有
从而
min θ , ω L F B ( θ , ω ) ⟺ min θ , ω 1 2 [ E s ′ ∼ p ( ⋅ ∣ s , a ) , a ′ ∼ π ( ⋅ ∣ s ′ ) s ′ ′ ∼ ρ , s ′ ′ ∈ X [ F θ ( s , a , z ) T B ω ( s ′ ′ ) − γ F ˉ θ ( s ′ , a ′ , z ) T B ˉ ω ( s ′ ′ ) ] 2 − F θ ( s , a , z ) T E s ′ ∼ P ( ⋅ ∣ s , a ) s ′ ∼ ρ B ω ( s ′ ) \begin{aligned}
\min_{\theta,\omega}\mathcal{L}_{FB}(\theta,\omega)\iff&\ \min_{\theta,\omega}\frac{1}{2}\left[\mathbb{E}_{\substack{s'\sim p(\cdot|s,a),a'\sim\pi(\cdot|s')\\s''\sim\rho,s''\in X}}[F_{\theta}(s,a,z)^TB_{\omega}(s'')-\gamma\bar{F}_{\theta}(s',a',z)^T\bar{B}_{\omega}(s'')\right]^2 \\
&\qquad\qquad -F_{\theta}(s,a,z)^T\mathbb{E}_{\substack{s'\sim P(\cdot|s,a)\\s'\sim \rho}}B_{\omega}(s')
\end{aligned}
θ , ω min L F B ( θ , ω ) ⟺ θ , ω min 2 1 [ E s ′ ∼ p ( ⋅ ∣ s , a ) , a ′ ∼ π ( ⋅ ∣ s ′ ) s ′ ′ ∼ ρ , s ′ ′ ∈ X [ F θ ( s , a , z ) T B ω ( s ′ ′ ) − γ F ˉ θ ( s ′ , a ′ , z ) T B ˉ ω ( s ′ ′ ) ] 2 − F θ ( s , a , z ) T E s ′ ∼ P ( ⋅ ∣ s , a ) s ′ ∼ ρ B ω ( s ′ )
假设batch大小为 n n n ,我们可以通过某些方法,得到 z i z_i z i 对应的轨迹片段,( s i , a i , s i ′ , z i ) , ⋯ (s_i,a_i,s'_i,z_i),\cdots ( s i , a i , s i ′ , z i ) , ⋯ ,采样 a i ′ ∼ π ( ⋅ ∣ s i , z i ) a'_i\sim \pi(\cdot|s_i,z_i) a i ′ ∼ π ( ⋅ ∣ s i , z i ) ,计算损失
L F B ( θ , ω ) = 1 n ( n − 1 ) ∑ i ≠ j [ F θ ( s i , a i , z i ) T B ω ( s j ) − γ F ˉ θ ( s i ′ , a i ′ , z i ) T B ˉ ω ( s j ) ] 2 − 1 n ∑ i F θ ( s i , a i , z i ) T B ω ( s i ′ ) \mathcal{L}_{FB}(\theta,\omega) = \frac{1}{n(n-1)}\sum_{i\neq j}\left[F_{\theta}(s_i,a_i,z_i)^TB_{\omega}(s_j)-\gamma\bar{F}_{\theta}(s'_i,a'_i,z_i)^T\bar{B}_{\omega}(s_j)\right]^2 - \frac{1}{n}\sum_{i}F_{\theta}(s_i,a_i,z_i)^TB_{\omega}(s'_i)
L F B ( θ , ω ) = n ( n − 1 ) 1 i = j ∑ [ F θ ( s i , a i , z i ) T B ω ( s j ) − γ F ˉ θ ( s i ′ , a i ′ , z i ) T B ˉ ω ( s j ) ] 2 − n 1 i ∑ F θ ( s i , a i , z i ) T B ω ( s i ′ )
带模仿的正则项理论推导
上述推导虽然已经对 F θ , B ω , π ϕ F_{\theta}, B_{\omega}, \pi_{\phi} F θ , B ω , π ϕ 进行优化,但是无法保证随机采样生成的 π ϕ \pi_{\phi} π ϕ 能够探索到状态空间中足够多的状态,因此我们需要引入充分大的专家数据集用来引导 π ϕ \pi_{\phi} π ϕ ,使其能够充分探索状态空间
专家数据集仅有状态构成,记为 M = { ( s 1 , ⋯ , s l ( τ ) ) } = { τ } M=\{(s_1,\cdots,s_{l(\tau)})\}=\{\tau\} M = { ( s 1 , ⋯ , s l ( τ ) ) } = { τ }
思考 :为什么专家数据集 M = { τ } M=\{\tau\} M = { τ } 可以探索到更多的状态?我们任取一个状态 s s s ,在机器人平衡这个问题上,一定会存在不同策略之间的优劣,而专家可以选择出正确的策略,使得在这些策略下,s s s 会被更多的探索到,也说明该策略更加稳定,因此会产生一个 s s s 和策略 π \pi π 的联合分布,记为 p M ( s , π ) p_{M}(s,\pi) p M ( s , π ) ,由于策略 π \pi π 无法作为神经网络输入,因此将 π \pi π 降维表示到低维空间 z ∈ R d z\in\mathbb{R}^d z ∈ R d 向量,对应的联合分布变为 p M ( s , z ) p_{M}(s,z) p M ( s , z )
策略神经网络 π ( ⋅ ∣ s , z ) \pi(\cdot|s,z) π ( ⋅ ∣ s , z ) 可以将 z z z 和 s s s 一同作为网络输入
那么类似地,对于 π ϕ ( ⋅ ∣ s , z ) \pi_{\phi}(\cdot|s,z) π ϕ ( ⋅ ∣ s , z ) 是否也有联合分布 p π z ( s , z ) p_{\pi_{z}}(s,z) p π z ( s , z ) ,对于每个 s s s ,在低维空间中也有其对应策略的分布,如果想让 π ϕ \pi_{\phi} π ϕ 类似专家策略,探索更多状态,我们应该想要 p π z ( s , z ) p_{\pi_z}(s,z) p π z ( s , z ) 去近似 p M ( s , z ) p_{M}(s,z) p M ( s , z )
这里可以用 K L KL K L 散度进行度量,但由于 p M ( s , z ) p_{M}(s,z) p M ( s , z ) 难以准确估计,因此需要用GAN的思路(本质上是Jensen-Shannon(JS)散度),创建一个判别网络 D ψ : S × R d → [ 0 , 1 ] D_{\psi}:S\times \mathbb{R}^d\to [0,1] D ψ : S × R d → [ 0 , 1 ] ,判别策略 s s s 和 z z z 是否来自 p M ( ⋅ ∣ s , z ) p_{M}(\cdot|s,z) p M ( ⋅ ∣ s , z ) 而非 p π ϕ ( ⋅ ∣ s , z ) p_{\pi_{\phi}}(\cdot|s,z) p π ϕ ( ⋅ ∣ s , z ) ,而我们的策略 π ϕ \pi_{\phi} π ϕ 期望让 p π ϕ ( ⋅ ∣ s , z ) p_{\pi_{\phi}}(\cdot|s,z) p π ϕ ( ⋅ ∣ s , z ) 近似 p M ( ⋅ ∣ s , z ) p_M(\cdot|s,z) p M ( ⋅ ∣ s , z ) 从而欺骗 D ψ D_{\psi} D ψ ,上述判别器学习过程可以描述为如下GAN损失
L d i s c r i m i n a t o r ( ψ ) = − E ( s , z ) ∼ p M [ log ( D ψ ( s , z ) ) ] − E ( s , z ) ∼ p π ϕ [ log ( 1 − D ψ ( s , z ) ) ] = − E s ∼ M [ log ( D ψ ( s , E s ′ ∼ τ ( s ) [ B ( s ′ ) ] ) ] − E z ∼ υ , s ∼ ρ π z [ log ( 1 − D ψ ( s , z ) ) ] \begin{aligned}
\mathcal{L}_{discriminator}(\psi) =&\ -\mathbb{E}_{(s,z)\sim p_{M}}[\log(D_{\psi}(s,z))]-\mathbb{E}_{(s,z)\sim p_{\pi_\phi}}[\log(1-D_{\psi}(s,z))]\\
=&\ -\mathbb{E}_{s\sim M}\left[\log(D_{\psi}(s,\mathbb{E}_{s'\sim\tau(s)}[B(s')])\right] -\mathbb{E}_{z\sim\upsilon,s\sim \rho^{\pi_z}}[\log(1-D_{\psi}(s,z))]
\end{aligned}
L d i s c r i m i n a t o r ( ψ ) = = − E ( s , z ) ∼ p M [ log ( D ψ ( s , z ) ) ] − E ( s , z ) ∼ p π ϕ [ log ( 1 − D ψ ( s , z ) ) ] − E s ∼ M [ log ( D ψ ( s , E s ′ ∼ τ ( s ) [ B ( s ′ ) ] ) ] − E z ∼ υ , s ∼ ρ π z [ log ( 1 − D ψ ( s , z ) ) ]
上式存在理论最优解 D ∗ ( s , z ) = p M ( s , z ) p M ( s , z ) + p π z ( s , z ) D^*(s,z)=\frac{p_{M}(s,z)}{p_{M}(s,z)+p_{\pi_z}(s,z)} D ∗ ( s , z ) = p M ( s , z ) + p π z ( s , z ) p M ( s , z )
证明方法可以直接对 p M log D + p π z log ( 1 − D ) p_{M}\log D+p_{\pi_z}\log(1-D) p M log D + p π z log ( 1 − D ) 中 D D D 求导,令其等于 0 0 0
π z = π ϕ ( ⋅ ∣ s , z ) \pi_z=\pi_{\phi}(\cdot|s,z) π z = π ϕ ( ⋅ ∣ s , z ) 的目标是混淆 D D D 的判断,也就是最大化下述奖励
max r ( s , z ) = log p M ( s , z ) p π z ( s , z ) = log D ∗ 1 − D ∗ ≈ log D ψ 1 − D ψ \max r(s,z) = \log\frac{p_{M}(s,z)}{p_{\pi_z}(s,z)} = \log\frac{D^*}{1-D^*} \approx \log\frac{D_{\psi}}{1-D_{\psi}}
max r ( s , z ) = log p π z ( s , z ) p M ( s , z ) = log 1 − D ∗ D ∗ ≈ log 1 − D ψ D ψ
用TD方式对上述折后回报进行估计,令 Q η ( s , a ) : S × A → R Q_{\eta}(s,a):S\times A\to \mathbb{R} Q η ( s , a ) : S × A → R ,可以称之为模仿回报,则对应的critic损失为
L c r i t i c ( η ) = E ( s , a , s ′ ) ∼ D o n l i n e z ∼ υ , a ′ ∼ π z ( ⋅ ∣ s ′ ) [ ( Q η ( s , a , z ) − log D ψ ( s ′ , z ) 1 − D ψ ( s ′ , z ) − γ Q ˉ η ( s ′ , a ′ , z ) 2 ] \mathcal{L}_{critic}(\eta) = \mathbb{E}_{\substack{(s,a,s')\sim D_{online}\\z\sim \upsilon,a'\sim\pi_z(\cdot|s')}}\left[\left(Q_{\eta}(s,a,z)-\log\frac{D_{\psi}(s',z)}{1-D_{\psi}(s',z)}-\gamma \bar{Q}_{\eta}(s',a',z\right)^2\right]
L c r i t i c ( η ) = E ( s , a , s ′ ) ∼ D o n l i n e z ∼ υ , a ′ ∼ π z ( ⋅ ∣ s ′ ) [ ( Q η ( s , a , z ) − log 1 − D ψ ( s ′ , z ) D ψ ( s ′ , z ) − γ Q ˉ η ( s ′ , a ′ , z ) 2 ]
将模仿奖励 Q Q Q 作为正则项加入到 ( 3 ) (3) ( 3 ) 式中得到FB-CPR的actor损失
L a c t o r ( ϕ ) = E s ∼ D , z ∼ υ a ∼ π ψ ( ⋅ ∣ s , z ) [ F θ ( s , a , z ) T z + α Q η ( s , a , z ) ] (4) \tag{4}\mathcal{L}_{actor}(\phi) = \mathbb{E}_{\substack{s\sim D,z\sim\upsilon\\a\sim\pi_{\psi}(\cdot|s,z)}}\left[F_{\theta}(s,a,z)^Tz+\alpha Q_{\eta}(s,a,z)\right]
L a c t o r ( ϕ ) = E s ∼ D , z ∼ υ a ∼ π ψ ( ⋅ ∣ s , z ) [ F θ ( s , a , z ) T z + α Q η ( s , a , z ) ] ( 4 )
综上,我们完成了4个主要损失函数的定义
L F B ( θ , ω ) \mathcal{L}_{FB}(\theta,\omega) L F B ( θ , ω ) :优化 F θ ( s , a , z ) , B ω ( s ) F_{\theta}(s,a,z),B_{\omega}(s) F θ ( s , a , z ) , B ω ( s )
L d i s c r i m i n a t o r ( ψ ) \mathcal{L}_{discriminator}(\psi) L d i s c r i m i n a t o r ( ψ ) :优化判别器 D ψ ( s , z ) D_{\psi}(s,z) D ψ ( s , z )
L c r i t i c ( η ) \mathcal{L}_{critic}(\eta) L c r i t i c ( η ) :优化模仿回报估计 Q η ( s , z ) Q_{\eta}(s,z) Q η ( s , z )
L a c t o r ( ϕ ) \mathcal{L}_{actor}(\phi) L a c t o r ( ϕ ) :优化策略 π ϕ ( ⋅ ∣ s , z ) \pi_{\phi}(\cdot|s,z) π ϕ ( ⋅ ∣ s , z )
训练流程
1. 在线数据收集
假设我们维护了一个在线训练buffer D o n l i n e \mathcal{D}_{online} D o n l i n e ,并有一个无标记的专家数据集 M \mathcal{M} M 提供优质轨迹,我们需要随机从二者中随机采样得到策略,分别对应的概率大小为 τ o n l i n e , τ u n l a b l e d \tau_{online},\tau_{unlabled} τ o n l i n e , τ u n l a b l e d ,每次采样的轨迹长度记为 T T T
通过随机获取 z z z ,并对应到策略 π ϕ ( ⋅ , z ) \pi_{\phi}(\cdot,z) π ϕ ( ⋅ , z ) ,从而采样得到轨迹加入到 D o n l i n e \mathcal{D}_{online} D o n l i n e 中,此处的 z z z 可以从三种不同的位置获得,我们就可以从三个不同位置获取到 z z z :
z = { B ( s ) , s ∼ D o n l i n e , 概率 τ o n l i n e , 1 T ∑ t = 1 T B ( s t ) , { s 1 , ⋯ , s T } ∼ M , 概率 τ u n l a b l e d , ∼ N ( 0 , I d ) , 概率 1 − τ o n l i n e − τ u n l a b l e d . z ← d z ∣ ∣ z ∣ ∣ 2 , 用 π ϕ ( ⋅ , z ) 与环境交互 T 步,将数据存储入 D o n l i n e z=\begin{cases}
B(s),&\quad s\sim \mathcal{D}_{online}, &\ \text{概率}\tau_{online},\\
\frac{1}{T}\sum_{t=1}^TB(s_t),&\quad \{s_1,\cdots,s_{T}\}\sim \mathcal{M},&\quad \text{概率}\tau_{unlabled},\\
\sim\mathcal{N}(0,I_d),&\quad &\quad \text{概率}1-\tau_{online}-\tau_{unlabled}.
\end{cases}\\
z\gets\sqrt{d}\frac{z}{||z||_2}, \text{用}\pi_{\phi}(\cdot,z)\text{与环境交互}T\text{步,将数据存储入} \mathcal{D}_{online}
z = ⎩ ⎪ ⎪ ⎨ ⎪ ⎪ ⎧ B ( s ) , T 1 ∑ t = 1 T B ( s t ) , ∼ N ( 0 , I d ) , s ∼ D o n l i n e , { s 1 , ⋯ , s T } ∼ M , 概率 τ o n l i n e , 概率 τ u n l a b l e d , 概率 1 − τ o n l i n e − τ u n l a b l e d . z ← d ∣ ∣ z ∣ ∣ 2 z , 用 π ϕ ( ⋅ , z ) 与环境交互 T 步,将数据存储入 D o n l i n e
2. 轨迹采样,编码专家策略
3. 计算判别损失
4. 在线数据潜特征重采样
5. 计算FB,正则,策略损失,更新网络