Reinforcement Learning

With respect to other Machine Learning learning methods, Reinforcement learning doesn’t have a classic supervision, instead we have a reward signal.

Reward

The Reward Signal $r$ is a feedback signal, a scalar, that indicates how well an agent is doing at time $t$ .

Reinforcement Learning is based off the Reward Hypothesis which states that all objectives can be described by maximizing the Expected Cumulative Reward.
Training is affected by the decision made during the reward engineering phase, where the reward function is designed.

Reward tells the agent wheter a decision is optimal or not. The Reward function is designed by a system designer, based on measurements of the agent’s performances, ensuring that the learning agent receives the necessary feedback to correct its behaviour.

This can have many problems:

reward shaping (Ng et al., 1999) sometimes you need to shape the reward signal to one more suitable for learning;
reward hacking (Skalse et al., 2002) learning agents can exploit reward-specific loopholes to achieve undesired outcomes while still generating high rewards.

Notation

For any integer $n \in N$ , we denote by $[n]$ the set $1, 2, ..., n$ . For any set $S$ , $Δ S$ denotes the set of probability distribution over $S$ .
We use $P (E)$ for the probability of some event $E$ , while $E [X]$ is used to denote the expected value of a random variable $X$ . $E_{P} [\cdot]$ or similar variations are used to emphasize that the distribution for the expected value is governed by the probability distribution $P \in Δ (S)$ .
Moreover, we will write $X \sim P$ if a random variable $X$ is distributed according to a probability distribution $P$ .

Reinforcement Learning, formally

Reinforcement learning is the setting of learning behavior from rewarded interaction with the environment (Sutton & Barto, 2018). It is formalized as a Markov decision process (MDP), which is a model for sequential decision making. It iteratively:

Observe its current state;
Takes an action taht causes the transition to a new state;
Receive a reward that depends on the action effectiveness.

Formally:

$S$ is a set of states (the state space);
$A$ is a set of actions (the action space);
$P : S \times A \to Δ (S)$ is a transition function (the transition dynamics);
$R : S \times A \to R$ is a reward function;
$d_{0} \in Δ (S)$ is a distribution over initial states;
$γ \in [0, 1]$ is a discount factor.

The transition function $P$ defines the dynamics of the environemnt: For any state $s$ and action $a$ $P (s^{'} ∣ s, a)$ is the probability of reaching $s^{'}$ after executing the action $a$ in state $s$ .
For a given state and action, the transition probability is conditionally indipendent of all previous states and action (Markov Property).

Instanteneous reward

The value of $R (a, s) \in R$ provides an immidiate evaluation after performing action $a$ in $s$ , which is called instantaneous reward¹.
When both the state space $S$ and the action space $A$ are finite we call the MDP a tabular MDP.

Return

In an MDP, an $H$ -step trajectory $τ$ is a sequence of $H \in N ∖ {0}$ pairs of state-action ending in a terminal state.
Formally, it is given by $τ = (s_{0}, a_{0}, s_{1}, a_{1}, ..., s_{H})$ . Given $t_{0} \geq 0$ and $H^{'} \leq H$ , we can define a segment $σ = (s_{t_{0}}, a_{t_{0}}, s_{t_{0 + 1}}, a_{t_{0 + 1}}, ..., s_{H^{'}})$ which refers to a continuous sequence of steps within a larger trajectory.
A trajectory $τ$ ‘s return $R_{t} (τ)$ is the accumulated (and discounted) rewards collected along this trajectory:

R_{t} (τ) = h = 0 \sum H - 1 γ^{h} R (s_{h}, a_{h})

So the return is a projection of future reward by taking a specific series of action-states discounted by a factor that gets exponentially smaller (thus reducing how important are rewards far into the future).

The return is well defined even in the horizon $H$ is infinite as long as $γ < 1$ . If the MDP is a tabular MDP and any trajectory has finite length, i.e. $H$ is necessarily finite, we call the MDP finite, otherwise is infinite.

Policy

A policy specifies how to select actions based on the state the agent is in. Tha can be done either:

deterministically: in this case we have a mapping $π : S \to A$ from states to actions.
stochastically: in this case we have a mapping $π : S \to Δ (A)$ from states to probability distributions over actions².

RL basic loop

The basic loop consists in the agent choosing an action $a_{t} \sim π (s_{t})$ based on its policy and current state. As a consequence, the environment transition into the new state $s_{t + 1} \sim P (s_{t}, a_{t})$ , governed by the transition dynamics. The agent observe the new state and the reward $r_{t + 1} \sim R (s, a)$ and the cicle starts anew.

In this setting, the RL agent aims at learning a policy that maximizes the expected return:

J (π) = E_{d_{0}, P, π} [R (τ)]

where the expectation is with respect to polici $π$ , transition function $P$ , and initial distribution $d_{0}$ .

Families

To solve this problem there are two different families of RL approaches:

1. Model-based RL

In this family we learn a model (i.e., $P, R$ ) of the underlying MDP to solve the RL problem.

2. Model-free RL

In this family we try to obtain a good policy without learning an MDP model. This family can also be divided in two sub-families:

2a. Value-based methods

(e.g. DQN)
We aim at learning the $Q$ -Function $Q *$ of an optimal policy $π$ which is defined by

Q_{π} (s, a) = E_{P, π} [h = 0 \sum H - 1 γ^{h} R (s_{h}, a_{h})]

where $s_{0} = s$ and $a_{0} = a$ and in the expectation, $a_{h} \sim π (\cdot ∣ s_{h})$ as well as $s_{h} \sim P (\cdot ∣ s_{h - 1}, a_{h - 1})$ for $h \in [H - 1]$ . A policy can be designed froma $Q$ -function by choosing an action in a greedy manner in each state: $π (s) = a r g ma x_{a} Q (s, a)$ . Note that for deterministic optimal policy $π *$ it holds that $J (π *) = E_{d_{0}} [Q * (s, π * (s))]$ .
Similar to the action-value function $Q$ , we can also define the state-value function:

V_{π} (s) = E_{P, π} = [h = 0 \sum H - 1 γ^{h} R (s_{h}, a_{h}) ∣ s_{0} = s]

Its value for some state $s$ is the expected return when starting in that state and then always using the policy $π$ . It is related to the $Q$ -Function by means of

V_{π} (s) = E_{a \sim π (s)} [Q_{π} (s, a)]

for any state $s \in S$ .

2b. Policy-search methods

This family aims at finding a good policy in some parametrized policy space. The most data-efficient algorithms follow an actor-critic scheme where bot an actor (i.e., a policy) and a critic (i.e., a Q-value function) are learned at the same time (e.g. PPO, TD3, SAC).

In deep RL both the value functions and the policies are approxiamted with neural networks.

RL algorithms can also be further classified as on-policy or off-policy.
In the first case, such as PPO, only the recetly generated transition are used for training. In contrast, in the off-policy algorithms (such as DQN), the agent can be updated with transition non necessarily generated by its current policy.
While on-policy is usually more stable, off-policy enables mode data-efficient learning by reusing samples from a replay buffer taht stores past transitions.

It’s possible that in some states the instantaneous reward is zero and the agent receive rewards only in specific states, like the terminal states, for which the transition function in zero. ↩
The deterministic case is a special case of the stochastic one, so we assume the latter case. ↩

📚 Michele's Notes

Explorer

Reinforcement Learning

Reward

Notation

Reinforcement Learning, formally

Instanteneous reward

Return

Policy

RL basic loop

Families

1. Model-based RL

2. Model-free RL

2a. Value-based methods

2b. Policy-search methods

Graph View

Table of Contents

Backlinks

📚 Michele's Notes

Explorer

Reinforcement Learning

Reward

Notation

Reinforcement Learning, formally

Instanteneous reward

Return

Policy

RL basic loop

Families

1. Model-based RL

2. Model-free RL

2a. Value-based methods

2b. Policy-search methods

Footnotes

Graph View

Table of Contents

Backlinks