Agent is the learner and decision maker.
A policy defines the learning agent’s way of behaving at a given time.
The reward signal indicates what is good immediately after each decision time.
A value function specifies what is good in the long run.
The agent–environment interaction in a decision process
Agent gives rise to a sequence that begins like this
\[ S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \dots S_t, A_t, R_{t+1} \]
expected return, \(G_t\), which is defined as: \[
G_t \doteq R_{t+1} + R_{t+2} + R_{t+3}+ \dots+R_T
\]Discounting technique is used to prioritize immediate rewards over future rewards.
\[ G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \]
A policy is a mapping from states to probabilities of selecting each possible action. If the agent is following policy \(\pi\) at time \(t\), then \(\pi(a \mid s)\) is the probability that \(A_t = a\) if \(S_t = s\)
Value functions—functions of states (or of state–action pairs) estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state)
where \(\mathbb{E}[\cdot]\) denotes the expected value of a random variable given that the agent follows policy \(\pi\), and \(t\) is any time step.
Model-basedPros:
Cons:
Model-freePros:
Cons:
Conditional probability of the next state given the current state and action taken \[ p(s' \mid s, a) \doteq \Pr \{ S_t = s' \mid S_{t-1} = s, A_{t-1} = a \}, \text{ for all } s',s \in S, \text{and} a \in A(s) \]
The function \(p\) defines the dynamics of the MDP
\(p(s', \mid s, a)\) is a dynamics function and \(p\) specifies a probability distribution for each choice of \(s\) and \(a\), that is: \[ \sum_{s'\in S}p(s'\mid s, a) = 1, \text{ for all } s \in S, a \in A(s) \]
\[ \begin{aligned} v_\pi(s) &\doteq \mathbb{E}_\pi \big[G_t \mid S_t = s \big] \\ &= \mathbb{E}_\pi \big[R_{t+1} + \gamma G_{t+1} \mid S_t = s \big] \\ &= \mathbb{E}_\pi \big[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s \big] \\ &= \sum_a \pi(a \mid s) \sum_{s',r} p(s', r \mid s, a) \big[ r + \gamma v_\pi(s') \big]. \end{aligned} \]
An optimal policy is a policy that achieves the maximum expected return from any initial state.
The optimal state-value function \(v_*\) is the maximum value function over all policies: \[ v_*(s) = \max_\pi v_\pi(s), \text{ for all } s \in S \]
The optimal action-value function \(q_*\) is the maximum action-value function over all policies: \[ q_*(s, a) = \max_\pi q_\pi(s, a), \text{ for all } s \in S, a \in A(s) \]
We use Dynamic Programming (DP) to leverage value functions in the search for good policies.
The Bellman optimality equation for \(v_*\) is: \[ \begin{aligned} v_*(s) &= \max_a \mathbb{E} \big[R_{t+1} + \gamma v_*(S_{t+1}) \mid S_{t} = s, A_{t} = a \big] \\ &= \max_a \sum_{s',r}p(s',r|s,a) \big[r+\gamma v_*(s')], \end{aligned} \]
The Bellman optimality equation for \(q_*\) is: \[ q_*(s, a) = \sum_{s', r} p(s', r \mid s, a) \big[ r + \gamma \max_{a'} q_*(s', a') \big] \]
Policy evaluation is the computation of the state-value function \(v_\pi\) for an arbitrary policy \(\pi\). We also refer to it as the *prediction problem. \[ \begin{aligned} v_\pi(s) &\doteq \mathbb{E}_\pi \big[G_t \mid S_t = s \big] \\ &= \mathbb{E}_\pi \big[R_{t+1} + \gamma G_{t+1} \mid S_t = s \big] \\ &= \mathbb{E}_\pi \big[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s \big] \\ &= \sum_a \pi(a \mid s) \sum_{s',r} p(s', r \mid s, a) \big[ r + \gamma v_\pi(s') \big]. \end{aligned} \]
where \(\pi(a|s)\) is the probability of taking action \(\alpha\) in state \(s\) under policy \(\pi\), and the expectations are subscripted by \(\pi\) to indicate that they are conditional on \(\pi\) being followed.
Consider a sequence of approximate value functions \(v_0, v_1, v_2,\dots,\). Each successive approximation is obtained by using the Bellman equation for \(v_\pi\) as an update rule:
\[ \begin{aligned} v_{k+1}(s) &\doteq \mathbb{E}_\pi \big[R_{t+1} + \gamma v_k(S_{t+1}) \mid S_t = s \big] \\ &= \sum_a \pi(a \mid s) \sum_{s',r} p(s', r \mid s, a) \big[ r + \gamma v_k(s') \big]. \end{aligned} \]
The sequence \({v_k}\) can be shown in general to converge to \(v_\pi\) as \(k \rightarrow \infty\) under the same conditions that guarantee the existence of \(v_\pi\). This algorithm is called iterative policy evaluation.
Iterative Policy Evaluation, for estimating \(V \approx v_\pi\)
Input:
\(\pi\), the policy to be evaluated
Algorithm parameter: a small threshold \(\theta > 0\) determining accuracy of estimation
Initialize \(V(s)\) arbitrarily, for \(s \in S\), and \(V(\text{terminal}) = 0\)
Loop:
\(\Delta \leftarrow 0\)
Loop for each \(s \in S\):
\(v \leftarrow V(s)\)
\(V(s) \leftarrow \sum_a \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma V(s')]\)
\(\Delta \leftarrow \max(\Delta, |v - V(s)|)\)
until \(\Delta < \theta\)
One way to answer this question is to consider selecting \(a\) in \(s\) and thereafter following the existing policy \(\pi\).
This leads to the definition of the q-value of a state-action pair:
\[ \begin{aligned} q_\pi(s, a) &\doteq \mathbb{E} \big[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s, A_t = a \big] \\ &= \sum_{s', r} p(s', r \mid s, a) \big[ r + \gamma v_\pi(s') \big]. \end{aligned} \]
The policy improvement theorem states that if we improve the policy by acting greedily with respect to \(q_\pi\), the new policy \(\pi'\) will be at least as good as \(\pi\).
Formally, if
\[ \begin{aligned} \pi'(s) &= \arg\max_a q_\pi(s, a) \\ &= \arg \max_a \mathbb{E} \big[R_{t+1}+\gamma v_\pi(S_{t+1}) \mid S_t=s, A_t=a \big] \\ &= \arg \max_a \sum_{s',r}p(s',r \mid s, a) \big[r+\gamma v_\pi(s') \big] \end{aligned} \]
then
\[ v_{\pi'}(s) \geq v_\pi(s). \]
for all \(s \in S\).
Once a policy, \(\pi\), has been improved using \(v_\pi\) to yield a better policy, \(\pi^{'}\), we can then compute \(v_{\pi^{'}}\) and improve it again to yield an even better \(\pi^{''}\).
\[ \pi_0 \xrightarrow E v_{\pi_0} \xrightarrow I \pi_1 \xrightarrow E v_{\pi_1}\xrightarrow I \pi_2,\dots \xrightarrow I \pi_* \xrightarrow E v_* \]
where \(\xrightarrow E\) denotes a policy evaluation and \(\xrightarrow I\) denotes a policy improvement. This way of finding an optimal policy is called policy iteration.
A complete policy iteration algorithm
Policy Iteration (using iterative policy evaluation) for estimating \(\pi \approx \pi_*\)
1. Initialization
\(V(s) \in \mathbb{R}\) and \(\pi(s) \in A(s)\) arbitrarily for all \(s \in S\)
\(V(\text{terminal}) \doteq 0\)
2. Policy Evaluation
Loop:
\(\Delta \leftarrow 0\)
Loop for each \(s \in S\):
\(v \leftarrow V(s)\)
\(V(s) \leftarrow \sum_{s', r} p(s', r \mid s, \pi(s)) \left[r + \gamma V(s')\right]\)
\(\Delta \leftarrow \max(\Delta, |v - V(s)|)\)
Until \(\Delta < \theta\) (a small positive number determining the accuracy of estimation)
3. Policy Improvement
policy-stable \(\leftarrow\) true
For each \(s \in S\):
old-action \(\leftarrow \pi(s)\)
\(\pi(s) \leftarrow \arg\max_a \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma V(s')\right]\)
If old-action \(\ne \pi(s)\), then policy-stable \(\leftarrow\) false
If policy-stable, then stop and return \(V \approx v_*\) and \(\pi \approx \pi_*\);
Else go to step 2
\[ \begin{aligned} v_{k+1}(s) &= \mathbb{E} \big[R_{t+1} + \gamma v_k(S_{t+1}) \mid S_t=s, A_t = a \big] \\ &=\max_a \sum_{s', r} p(s', r \mid s, a) \big[ r + \gamma v_k(s') \big] \end{aligned} \]
for all \(s \in S\).
Value Iteration, for estimating \(\pi \approx \pi_*\)
Algorithm parameter:
A small threshold \(\theta > 0\) determining the accuracy of estimation
Initialization:
Initialize \(V(s)\) arbitrarily for all \(s \in S^+\), except that \(V(\text{terminal}) = 0\)
Loop:
\(\Delta \leftarrow 0\)
Loop for each \(s \in S\):
\(v \leftarrow V(s)\)
\(V(s) \leftarrow \max_a \sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma V(s') \right]\)
\(\Delta \leftarrow \max(\Delta, |v - V(s)|)\)
until \(\Delta < \theta\)
Output:
A deterministic policy, \(\pi \approx \pi_*\), such that
\(\pi(s) = \arg\max_a \sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma V(s') \right]\)
GPI is a general idea that describes how two processes — evaluating a policy and improving it — work together and influence each other.
Most reinforcement learning algorithms use a value function to judge how good a policy is, and then update the policy based on that judgment.
If both value estimation and policy improvement stabilize (i.e., stop changing), then the policy must be the best possible one for that value function—meaning the policy is optimal.

The term “Monte Carlo” is often used more broadly for any estimation method whose operation involves a significant random component.
MC methods solve reinforcement learning problems by averaging results (returns) from sampled experiences sequences of states, actions, and rewards.
They do not require knowledge of the environment’s dynamics, making them powerful for learning from real or simulated experiences.
First-Visit Monte Carlo Prediction (for estimating \(V \approx v_\pi\))
Input:
A policy \(\pi\) to be evaluated
Initialize:
\(V(s) \in \mathbb{R}\) arbitrarily, for all \(s \in S\)
\(\text{Returns}(s) \leftarrow\) an empty list, for all \(s \in S\)
Loop forever (for each episode):
Generate an episode following \(\pi\): \(S_0, A_0, R_1, S_1, A_1, R_2, \dots, S_{T-1}, A_{T-1}, R_T\)
\(G \leftarrow 0\)
Loop for each step of the episode, \(t = T-1, T-2, \dots, 0\):
\(G \leftarrow \gamma G + R_{t+1}\)
Unless \(S_t\) appears in \(S_0, S_1, \dots, S_{t-1}\):
Append \(G\) to \(\text{Returns}(S_t)\)
\(V(S_t) \leftarrow \text{average}(\text{Returns}(S_t))\)
Alternating complete steps of policy evaluation and policy improvement are performed, beginning with an arbitrary policy \(\pi_0\) and ending with the optimal policy and optimal action-value function:
\[ \pi_0 \xrightarrow E q_{\pi_0} \xrightarrow I \pi_1 \xrightarrow E q_{\pi_1}\xrightarrow I \pi_2,\dots,\xrightarrow I \pi_* \xrightarrow E q_* \]
Monte Carlo ES (Exploring Starts), for estimating \(\pi \approx \pi_*\)
Initialize:
\(\pi(s) \in A(s)\) arbitrarily, for all \(s \in S\)
\(Q(s, a) \in \mathbb{R}\) arbitrarily, for all \(s \in S, a \in A(s)\)
\(\text{Returns}(s, a) \leftarrow\) empty list, for all \(s \in S, a \in A(s)\)
Loop forever (for each episode):
Choose \(S_0 \in S, A_0 \in A(S_0)\) randomly, such that all pairs have probability \(> 0\)
Generate an episode from \(S_0, A_0\), following \(\pi\):
\(S_0, A_0, R_1, \dots, S_{T-1}, A_{T-1}, R_T\)
\(G \leftarrow 0\)
Loop for each step of episode, \(t = T-1, T-2, \dots, 0\):
\(G \leftarrow \gamma G + R_{t+1}\)
Unless the pair \((S_t, A_t)\) appears in
\(S_0, A_0, S_1, A_1, \dots, S_{t-1}, A_{t-1}\):
Append \(G\) to \(\text{Returns}(S_t, A_t)\)
\(Q(S_t, A_t) \leftarrow \text{average}(\text{Returns}(S_t, A_t))\)
\(\pi(S_t) \leftarrow \arg\max_a Q(S_t, a)\)
On-policy First-Visit MC Control (for \(\varepsilon\)-soft policies), estimates \(\pi \approx \pi_*\)
Algorithm parameter:
Small \(\varepsilon > 0\)
Initialize:
\(\pi \leftarrow\) an arbitrary \(\varepsilon\)-soft policy
\(Q(s, a) \in \mathbb{R}\) arbitrarily, for all \(s \in S, a \in A(s)\)
\(\text{Returns}(s, a) \leftarrow\) empty list, for all \(s \in S, a \in A(s)\)
Repeat forever (for each episode):
Generate an episode following \(\pi\): \(S_0, A_0, R_1, \dots, S_{T-1}, A_{T-1}, R_T\)
\(G \leftarrow 0\)
Loop for each step of episode, \(t = T-1, T-2, \dots, 0\):
\(G \leftarrow \gamma G + R_{t+1}\)
Unless the pair \((S_t, A_t)\) appears in \(S_0, A_0, \dots, S_{t-1}, A_{t-1}\):
Append \(G\) to \(\text{Returns}(S_t, A_t)\)
\(Q(S_t, A_t) \leftarrow \text{average}(\text{Returns}(S_t, A_t))\)
\(A^* \leftarrow \arg\max_a Q(S_t, a)\) (ties broken arbitrarily)
For all \(a \in A(S_t)\):
\(\pi(a \mid S_t) \leftarrow \begin{cases}
1 - \varepsilon + \varepsilon / |A(S_t)| & \text{if } a = A^* \\
\varepsilon / |A(S_t)| & \text{if } a \ne A^*
\end{cases}\)
where \(|A(s)|\) is the number of actions available in state \(s\).
The \(\epsilon\)-greedy policy ensures that all actions are tried, but actions with higher value estimates are tried more frequently. This balances exploration (trying new actions) and exploitation (choosing the best-known action).
Whereas Monte Carlo methods must wait until the end of the episode to determine the increment to \(V(St)\) (only then is \(G_t\) known), TD methods need to wait only until the next time step. At time t + 1 they immediately form a target and make a useful update using the observed reward \(R_{t+1}\) and the estimate \(V_{S_{t+1}}\). The simplest TD method makes the update:
\[ V(S_t) \leftarrow V(S_t) + \alpha \big[ R_{t+1} + \gamma V(S_{t+1})-V(S_t) \big] \]
immediately on transition to \(S_{t+1}\) and receiving \(R_{t+1}\). In effect, the target for the Monte Carlo update is \(G_t\), whereas the target for the TD update is \(R_{t+1} + \gamma V(S_{t+1})\). This TD method is called TD(0), or one-step TD.
Tabular TD(0) for Estimating \(v_\pi\)
Input:
The policy \(\pi\) to be evaluated
Algorithm parameter:
Step size \(\alpha \in (0, 1]\)
Initialize:
\(V(s)\) arbitrarily for all \(s \in S^+\), except that \(V(\text{terminal}) = 0\)
Loop for each episode:
Initialize \(S\)
Loop for each step of episode:
\(A \leftarrow\) action given by \(\pi\) for \(S\)
Take action \(A\), observe \(R\), \(S'\)
\(V(S) \leftarrow V(S) + \alpha \left[ R + \gamma V(S') - V(S) \right]\)
\(S \leftarrow S'\)
Until \(S\) is terminal
we consider transitions from state–action pair to state–action pair, and learn the values of state–action pairs.
\[ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \big[R_{t+1}+\gamma Q(S_{t+1}, A_{t+1})-Q(S_t, A_t) \big] \]
SARSA (On-Policy TD Control), for estimating \(Q \approx q_*\)
Algorithm parameters:
Step size \(\alpha \in (0, 1]\), small \(\varepsilon > 0\)
Initialize:
\(Q(s, a)\) arbitrarily, for all \(s \in S^+\), \(a \in A(s)\)
\(Q(\text{terminal}, \cdot) = 0\)
Loop for each episode:
Initialize \(S\)
Choose \(A\) from \(S\) using a policy derived from \(Q\) (e.g., \(\varepsilon\)-greedy)
Loop for each step of episode:
Take action \(A\), observe \(R\), \(S'\)
Choose \(A'\) from \(S'\) using a policy derived from \(Q\) (e.g., \(\varepsilon\)-greedy)
\(Q(S, A) \leftarrow Q(S, A) + \alpha \left[ R + \gamma Q(S', A') - Q(S, A) \right]\)
\(S \leftarrow S';\quad A \leftarrow A'\)
Until \(S\) is terminal
Q-learning is defined by
\[ Q(S_t, A_t) \leftarrow Q(S_t, A_t)+ \alpha \big[R_{t+1}+\gamma \max_aQ(S_{t+1}, a)-Q(S_t, A_t) \big] \]
The Q-learning algorithm is shown below in procedural form.
Q-learning (Off-Policy TD Control), for estimating \(\pi \approx \pi_*\)
Algorithm parameters:
Step size \(\alpha \in (0, 1]\), small \(\varepsilon > 0\)
Initialize:
\(Q(s, a)\) arbitrarily for all \(s \in S^+\), \(a \in S(s)\),
except that \(Q(\text{terminal}, \cdot) = 0\)
Loop for each episode:
Initialize \(S\)
Loop for each step of episode:
Choose \(A\) from \(S\) using a policy derived from \(Q\) (e.g., \(\varepsilon\)-greedy)
Take action \(A\), observe \(R\), \(S'\)
\(Q(S, A) \leftarrow Q(S, A) + \alpha \left[ R + \gamma \max_a Q(S', a) - Q(S, A) \right]\)
\(S \leftarrow S'\)
Until \(S\) is terminal