

29 May 2025
Reinforcement learning is learning what to do—how to map situations to actions to maximize a numerical reward signal.
Model-Free Adaptation: Learns optimal behavior directly from experience
Long-Term Planning: Optimizes decisions over long time horizons
Data-Driven Policies: Automatically refines its policy as more data becomes available
Scalable to Complex Tasks: Can handle very large state-action spaces via function approximation.
Agent is the learner and decision maker.
A policy defines the learning agent’s way of behaving at a given time.
The reward signal indicates what is good immediately after each decision time.
A value function specifies what is good in the long run.
MDPs are a classical formalization of sequential decision making or reinforcement learning problem
Actions taken now influence not only immediate contributions (e.g. reward, cost, profit), but also future situations (states), and consequently, future contributions
MDPs involve delayed contribution and the need to trade off immediate and delayed contributions
The agent–environment interaction in a decision process
Agent gives rise to a sequence that begins like this
\[ S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \dots S_t, A_t, R_{t+1} \]
An MDP can be described by the tuple \((S, A, p, r(a,s), \pi)\), where:
The \(p(s', \mid s, a)\) defines the dynamics of the MDP and \(p\) specifies a probability distribution for each choice of \(s\) and \(a\): \[ \sum_{s'\in S}p(s' \mid s, a) = 1, \text{ for all } s \in S, a \in A(s) \]
expected return, \(G_t\), which is defined as: \[
G_t \doteq R_{t+1} + R_{t+2} + R_{t+3}+ \dots+R_T
\]Discounting technique is used to prioritize immediate rewards over future rewards.
\[ G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \]
A policy is a mapping from states to probabilities of selecting each possible action. If the agent is following policy \(\pi\) at time \(t\), then \(\pi(a \mid s)\) is the probability that \(A_t = a\) if \(S_t = s\)
Value functions—functions of states estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state).
The value of a state \(s\) under a policy \(\pi\), denoted \(v_\pi\) is expressed as \[ v_\pi(s) \doteq \mathbb{E}_\pi[G_t | S_t=s] = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \mid S_t=s\right], \text{ for all } s \in S, \]
where \(\mathbb{E}[\cdot]\) denotes the expected value of a random variable given that the agent follows policy \(\pi\), and \(t\) is any time step.
To solve stochastic sequential decision problems
We have to model the fact that new information becomes available after we make the decision \(a_t\).
The result can be uncertainty in both the contribution earned, and in the determination of the next state we visit, \(S_{t+1}\).
We estimate the expected future contributions (cumulative rewards) and add the immediate contribution received from the action taken in the current state following a given policy. This relationship is expressed by the Bellman equation, defined as:
\[ V_\pi(S_t) = \max_{a_t \in A_t} \left(r_{t+1}(S_t, a_t) +\ \gamma \mathbb{E}[V_{t+1}(S_{t+1}) \mid S_t, a_t] \right) \]
This equation can be solved backward or forward recursively. You get the value functions and at the same time an optimal policy at time \(t\)
\[ a_t^*(S_t) = \arg\max_{a_t \in A_t} \left(r_{t+1}(S_t, a_t) + \gamma [V_{t+1}(S_{t+1}) \mid S_t, a_t] \right), \]
Booking Process:
Exclusively 1-day rentals, must booked the day before.
Bookings arrive randomly in two batches (Morning, Afternoon) of previous day
Each batch has one of the four possible demand (# Large requests, # Small requests)
Booking Controls:
| (# Large requests, # Small Requests) | (0,0) | (0,1) | (1,0) | (1,1) |
|---|---|---|---|---|
| Arrival Probability | 3% | 10% | 20% | 67% |
| Small Car | Large car |
|---|---|
| £30 | £40 |

Goal: Find a booking control policy that maximises the firm’s expected revenues
Cars can be rented for more than one day
Longer booking periods
More types of cars
Larger fleet
A network of rental stations
Advanced booking
The Bellman equation could not be solved anymore.
Policy evaluation is the computation of the state-value function \(v_\pi\) for an arbitrary policy \(\pi\).
Consider a sequence of approximate value functions \(v_0, v_1, v_2,\dots,\). Each successive approximation is obtained by using the Bellman equation for \(v_\pi\) as an update rule:
\[ \begin{aligned} v_{k+1}(s) &\doteq \mathbb{E}_\pi \big[G_t \mid S_t = s \big] \\ &= \mathbb{E}_\pi \big[r_{t+1} + \gamma G_{t+1} \mid S_t = s \big] \\ &\doteq \mathbb{E}_\pi \big[r_{t+1} + \gamma v_k(S_{t+1}) \mid S_t = s \big] \\ &= \sum_a \pi(a \mid s) \sum_{s',r} p(s', r \mid s, a) \big[ r + \gamma v_k(s') \big]. \end{aligned} \]
where \(k\) refers to iteration number.
The sequence \({v_k}\) can be shown in general to converge to \(v_\pi\) as \(k \rightarrow \infty\). This algorithm is called iterative policy evaluation.
Iterative Policy Evaluation, for estimating \(V \approx v_\pi\)
Input:
\(\pi\), the policy to be evaluated
Algorithm parameter: a small threshold \(\theta > 0\) determining accuracy of estimation
Initialize \(V(s)\) arbitrarily, for \(s \in S\), and \(V(\text{terminal}) = 0\)
Loop:
\(\Delta \leftarrow 0\)
Loop for each \(s \in S\):
\(v \leftarrow V(s)\)
\(V(s) \leftarrow \sum_a \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma V(s')]\)
\(\Delta \leftarrow \max(\Delta, |v - V(s)|)\)
until \(\Delta < \theta\)
We know how good it is to follow the current policy from \(s\)—that is \(v_\pi(s)\)
But would it be better or worse to change to the new policy?
Once a policy, \(\pi\), has been improved using \(v_\pi\) to yield a better policy, \(\pi^{'}\), we can then compute \(v_{\pi^{'}}\) and improve it again to yield an even better \(\pi^{''}\).
\[ \pi_0 \xrightarrow E v_{\pi_0} \xrightarrow I \pi_1 \xrightarrow E v_{\pi_1}\xrightarrow I \pi_2,\dots \xrightarrow I \pi_* \xrightarrow E v_* \]
where \(\xrightarrow E\) denotes a policy evaluation and \(\xrightarrow I\) denotes a policy improvement. This way of finding an optimal policy is called policy iteration.
A complete policy iteration algorithm
Policy Iteration (using iterative policy evaluation) for estimating \(\pi \approx \pi_*\)
1. Initialization
\(V(s) \in \mathbb{R}\) and \(\pi(s) \in A(s)\) arbitrarily for all \(s \in S\)
\(V(\text{terminal}) \doteq 0\)
2. Policy Evaluation
Loop:
\(\Delta \leftarrow 0\)
Loop for each \(s \in S\):
\(v \leftarrow V(s)\)
\(V(s) \leftarrow \sum_{s', r} p(s', r \mid s, \pi(s)) \left[r + \gamma V(s')\right]\)
\(\Delta \leftarrow \max(\Delta, |v - V(s)|)\)
Until \(\Delta < \theta\) (a small positive number determining the accuracy of estimation)
3. Policy Improvement
policy-stable \(\leftarrow\) true
For each \(s \in S\):
old-action \(\leftarrow \pi(s)\)
\(\pi(s) \leftarrow \arg\max_a \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma V(s')\right]\)
If old-action \(\ne \pi(s)\), then policy-stable \(\leftarrow\) false
If policy-stable, then stop and return \(V \approx v_*\) and \(\pi \approx \pi_*\);
Else go to step 2
\[ \begin{aligned} v_{k+1}(s) &= \mathbb{E} \big[R_{t+1} + \gamma v_k(S_{t+1}) \mid S_t=s, A_t = a \big] \\ &=\max_a \sum_{s', r} p(s', r \mid s, a) \big[ r + \gamma v_k(s') \big] \end{aligned} \]
for all \(s \in S\).
Value Iteration, for estimating \(\pi \approx \pi_*\)
Algorithm parameter:
A small threshold \(\theta > 0\) determining the accuracy of estimation
Initialization:
Initialize \(V(s)\) arbitrarily for all \(s \in S^+\), except that \(V(\text{terminal}) = 0\)
Loop:
\(\Delta \leftarrow 0\)
Loop for each \(s \in S\):
\(v \leftarrow V(s)\)
\(V(s) \leftarrow \max_a \sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma V(s') \right]\)
\(\Delta \leftarrow \max(\Delta, |v - V(s)|)\)
until \(\Delta < \theta\)
Output:
A deterministic policy, \(\pi \approx \pi_*\), such that
\(\pi(s) = \arg\max_a \sum_{s', r} p(s', r \mid s, a) \left[ r + \gamma V(s') \right]\)
The term “Monte Carlo” is often used more broadly for any estimation method whose operation involves a significant random component.
MC methods solve reinforcement learning problems by averaging results (returns) from sampled experiences sequences of states, actions, and rewards.
They do not require knowledge of the environment’s dynamics, making them powerful for learning from real or simulated experiences.
Suppose we wish to estimate \(v_{\pi}(s)\), the values of a state \(s\) under policy \(\pi\), given a set of episodes obtained by following \(\pi\) and passing through \(s\).
First-Visit Monte Carlo Prediction (for estimating \(V \approx v_\pi\))
Input:
A policy \(\pi\) to be evaluated
Initialize:
\(V(s) \in \mathbb{R}\) arbitrarily, for all \(s \in S\)
\(\text{Returns}(s) \leftarrow\) an empty list, for all \(s \in S\)
Loop forever (for each episode):
Generate an episode following \(\pi\): \(S_0, A_0, R_1, S_1, A_1, R_2, \dots, S_{T-1}, A_{T-1}, R_T\)
\(G \leftarrow 0\)
Loop for each step of the episode, \(t = T-1, T-2, \dots, 0\):
\(G \leftarrow \gamma G + R_{t+1}\)
Unless \(S_t\) appears in \(S_0, S_1, \dots, S_{t-1}\):
Append \(G\) to \(\text{Returns}(S_t)\)
\(V(S_t) \leftarrow \text{average}(\text{Returns}(S_t))\)
Alternating complete steps of policy evaluation and policy improvement are performed, beginning with an arbitrary policy \(\pi_0\) and ending with the optimal policy and optimal action-value function:
\[ \pi_0 \xrightarrow E q_{\pi_0} \xrightarrow I \pi_1 \xrightarrow E q_{\pi_1}\xrightarrow I \pi_2,\dots,\xrightarrow I \pi_* \xrightarrow E q_* \]
Monte Carlo ES (Exploring Starts), for estimating \(\pi \approx \pi_*\)
Initialize:
\(\pi(s) \in A(s)\) arbitrarily, for all \(s \in S\)
\(Q(s, a) \in \mathbb{R}\) arbitrarily, for all \(s \in S, a \in A(s)\)
\(\text{Returns}(s, a) \leftarrow\) empty list, for all \(s \in S, a \in A(s)\)
Loop forever (for each episode):
Choose \(S_0 \in S, A_0 \in A(S_0)\) randomly, such that all pairs have probability \(> 0\)
Generate an episode from \(S_0, A_0\), following \(\pi\):
\(S_0, A_0, R_1, \dots, S_{T-1}, A_{T-1}, R_T\)
\(G \leftarrow 0\)
Loop for each step of episode, \(t = T-1, T-2, \dots, 0\):
\(G \leftarrow \gamma G + R_{t+1}\)
Unless the pair \((S_t, A_t)\) appears in
\(S_0, A_0, S_1, A_1, \dots, S_{t-1}, A_{t-1}\):
Append \(G\) to \(\text{Returns}(S_t, A_t)\)
\(Q(S_t, A_t) \leftarrow \text{average}(\text{Returns}(S_t, A_t))\)
\(\pi(S_t) \leftarrow \arg\max_a Q(S_t, a)\)
On-policy First-Visit MC Control (for \(\varepsilon\)-soft policies), estimates \(\pi \approx \pi_*\)
Algorithm parameter:
Small \(\varepsilon > 0\)
Initialize:
\(\pi \leftarrow\) an arbitrary \(\varepsilon\)-soft policy
\(Q(s, a) \in \mathbb{R}\) arbitrarily, for all \(s \in S, a \in A(s)\)
\(\text{Returns}(s, a) \leftarrow\) empty list, for all \(s \in S, a \in A(s)\)
Repeat forever (for each episode):
Generate an episode following \(\pi\): \(S_0, A_0, R_1, \dots, S_{T-1}, A_{T-1}, R_T\)
\(G \leftarrow 0\)
Loop for each step of episode, \(t = T-1, T-2, \dots, 0\):
\(G \leftarrow \gamma G + R_{t+1}\)
Unless the pair \((S_t, A_t)\) appears in \(S_0, A_0, \dots, S_{t-1}, A_{t-1}\):
Append \(G\) to \(\text{Returns}(S_t, A_t)\)
\(Q(S_t, A_t) \leftarrow \text{average}(\text{Returns}(S_t, A_t))\)
\(A^* \leftarrow \arg\max_a Q(S_t, a)\) (ties broken arbitrarily)
For all \(a \in A(S_t)\):
\(\pi(a \mid S_t) \leftarrow \begin{cases}
1 - \varepsilon + \varepsilon / |A(S_t)| & \text{if } a = A^* \\
\varepsilon / |A(S_t)| & \text{if } a \ne A^*
\end{cases}\)
where \(|A(s)|\) is the number of actions available in state \(s\).
The \(\epsilon\)-greedy policy ensures that all actions are tried, but actions with higher value estimates are tried more frequently. This balances exploration (trying new actions) and exploitation (choosing the best-known action).
The simplest TD method updates \(v_(S_t)\) as follow:
\[ v(S_t) \leftarrow v(S_t) + \alpha \big[r_{t+1} + \gamma v(S_{t+1})-v(S_t) \big] \]
Tabular TD(0) for Estimating \(v_\pi\)
Input:
The policy \(\pi\) to be evaluated
Algorithm parameter:
Step size \(\alpha \in (0, 1]\)
Initialize:
\(V(s)\) arbitrarily for all \(s \in S^+\), except that \(V(\text{terminal}) = 0\)
Loop for each episode:
Initialize \(S\)
Loop for each step of episode:
\(A \leftarrow\) action given by \(\pi\) for \(S\)
Take action \(A\), observe \(R\), \(S'\)
\(V(S) \leftarrow V(S) + \alpha \left[ R + \gamma V(S') - V(S) \right]\)
\(S \leftarrow S'\)
Until \(S\) is terminal
We consider transitions from state–action pair to state–action pair, and learn the values of state–action pairs.
\[ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \big[R_{t+1}+\gamma Q(S_{t+1}, A_{t+1})-Q(S_t, A_t) \big] \]
SARSA (On-Policy TD Control), for estimating \(Q \approx q_*\)
Algorithm parameters:
Step size \(\alpha \in (0, 1]\), small \(\varepsilon > 0\)
Initialize:
\(Q(s, a)\) arbitrarily, for all \(s \in S^+\), \(a \in A(s)\)
\(Q(\text{terminal}, \cdot) = 0\)
Loop for each episode:
Initialize \(S\)
Choose \(A\) from \(S\) using a policy derived from \(Q\) (e.g., \(\varepsilon\)-greedy)
Loop for each step of episode:
Take action \(A\), observe \(R\), \(S'\)
Choose \(A'\) from \(S'\) using a policy derived from \(Q\) (e.g., \(\varepsilon\)-greedy)
\(Q(S, A) \leftarrow Q(S, A) + \alpha \left[ R + \gamma Q(S', A') - Q(S, A) \right]\)
\(S \leftarrow S';\quad A \leftarrow A'\)
Until \(S\) is terminal
Q-learning is defined by
\[ Q(S_t, A_t) \leftarrow Q(S_t, A_t)+ \alpha \big[R_{t+1}+\gamma \max_aQ(S_{t+1}, a)-Q(S_t, A_t) \big] \]
The Q-learning algorithm is shown below in procedural form.
Q-learning (Off-Policy TD Control), for estimating \(\pi \approx \pi_*\)
Algorithm parameters:
Step size \(\alpha \in (0, 1]\), small \(\varepsilon > 0\)
Initialize:
\(Q(s, a)\) arbitrarily for all \(s \in S^+\), \(a \in S(s)\),
except that \(Q(\text{terminal}, \cdot) = 0\)
Loop for each episode:
Initialize \(S\)
Loop for each step of episode:
Choose \(A\) from \(S\) using a policy derived from \(Q\) (e.g., \(\varepsilon\)-greedy)
Take action \(A\), observe \(R\), \(S'\)
\(Q(S, A) \leftarrow Q(S, A) + \alpha \left[ R + \gamma \max_a Q(S', a) - Q(S, A) \right]\)
\(S \leftarrow S'\)
Until \(S\) is terminal
Model-basedPros:
Cons:
Model-freePros:
Cons:
Steve Brunton: