Off-Policy

释义 Definition

Off-policy（离策略/离线策略）：在强化学习中，指用一种策略收集到的数据（行为策略，behavior policy）来学习或评估另一种策略（目标策略，target policy）的方法。常见于能利用历史数据、回放缓冲区（replay buffer）或与当前学习策略不一致的探索行为的算法中。（该词在其他语境也可泛指“偏离既定政策/方针”，但最常见用法在强化学习领域。）

发音 Pronunciation (IPA)

/ˌɔːf ˈpɑːləsi/

例句 Examples

We trained the agent with off-policy data from old logs.
我们用旧日志里的离策略数据来训练智能体。

Off-policy learning can be more sample-efficient, but it often needs techniques like importance sampling to reduce bias.
离策略学习在样本利用率上可能更高，但往往需要重要性采样等技术来降低偏差。

词源 Etymology

off- 表示“偏离、在……之外”，policy 表示“策略”。在强化学习里，“off-policy”强调学习所针对的策略与产生数据的策略不一致，即“离开（所学）策略来取数/学习”。

文学与著作 Literary Works

Reinforcement Learning: An Introduction（Sutton & Barto）——系统讲解 on-policy 与 off-policy 方法，并以重要性采样等为核心工具。
Human-level control through deep reinforcement learning（Mnih et al., Nature 2015）——深度 Q 网络（DQN）使用经验回放，典型的离策略训练框架。
Deterministic Policy Gradient Algorithms（Silver et al., 2014）与后续 DDPG 相关论文——以离策略方式结合回放缓冲区进行连续动作控制学习。

Off-Policy

释义 Definition

发音 Pronunciation (IPA)

例句 Examples

词源 Etymology

相关词 Related Words

文学与著作 Literary Works