The difference between on-policy and off-policy

Learns about the policy it is currently using to make decisions.

The same policy is used for both:

Example: SARSA

Learns about one policy while following another.

Uses:

Example: Q-learning

Even if the agent explores randomly, it updates Q-values based on the best possible action (max over Q), as if it was following a greedy policy.

Example:

🎯 Imagine you’re learning to play chess.

🧑 On-policy (SARSA-like):

You’re learning by watching yourself play, including your mistakes and explorations.
You play games using a mix of good and random moves (e.g., trying new strategies).
You update your knowledge based on what you actually did in each game.
You get better at playing the way you currently play, gradually improving it.

✅ On-policy:

You learn the value of the policy you’re following.

👀 Off-policy (Q-learning-like):

You’re learning how a grandmaster would play, while still playing your own games.
You still explore (e.g., try risky moves), but…
When updating your knowledge, you ask what the best move would have been, not the one you actually chose.
You get better at playing like the grandmaster, even though you’re not playing like them yet.

✅ This is off-policy:

You learn the value of a different, better policy (often greedy), while following a more exploratory one.

Reinforcement learning is a framework for learning how to interact with environment from experience.
Agent takes actions to interact with environment
The big challange in RL Design a policy of what actions to take given a state s to maximize a future reward
Q(s,a) tells us what is the quality of being in state s and taking action a. Then once I find myself in a state s, I just have to look across all of the actions and pick the one that gives the best quality. If I do that in future I will maximize my value.