The difference between on-policy and off-policy

On-Policy Learning

Learns about the policy it is currently using to make decisions.

The same policy is used for both:

  • Acting (exploring the environment)

  • Learning (updating value functions)

Example: SARSA

  • Updates Q-values using the action that was actually taken.

  • Learns the value of the \(\varepsilon\)-greedy policy it follows.

Off-Policy Learning

Learns about one policy while following another.

Uses:

  • A behavior policy for exploration (e.g., \(\varepsilon\)-greedy)

  • A target policy for learning (e.g., greedy)

Example: Q-learning

  • Even if the agent explores randomly, it updates Q-values based on the best possible action (max over Q), as if it was following a greedy policy.

Example:

🎯 Imagine you’re learning to play chess.

🧑 On-policy (SARSA-like):

  • You’re learning by watching yourself play, including your mistakes and explorations.
  • You play games using a mix of good and random moves (e.g., trying new strategies).
  • You update your knowledge based on what you actually did in each game.
  • You get better at playing the way you currently play, gradually improving it.

✅ On-policy:

You learn the value of the policy you’re following.

👀 Off-policy (Q-learning-like):

  • You’re learning how a grandmaster would play, while still playing your own games.
  • You still explore (e.g., try risky moves), but…
  • When updating your knowledge, you ask what the best move would have been, not the one you actually chose.
  • You get better at playing like the grandmaster, even though you’re not playing like them yet.

✅ This is off-policy:

You learn the value of a different, better policy (often greedy), while following a more exploratory one.

Notes

  • Reinforcement learning is a framework for learning how to interact with environment from experience.
  • Agent takes actions to interact with environment
  • The big challange in RL Design a policy of what actions to take given a state s to maximize a future reward
  • Q(s,a) tells us what is the quality of being in state s and taking action a. Then once I find myself in a state s, I just have to look across all of the actions and pick the one that gives the best quality. If I do that in future I will maximize my value.