2025.02.06 - Lecture 2

Fuzzy Logic

Fuzzy sets are a type of set where membership cannot be established with certainty but trough a continuous function.

Different types of Uncertainty

Uncertainty analysis in Machine Learning begins with the formal definition of the uncertain quantities involved in the modelling problem.

Aleatoric uncertainty can’t be reduced while epistemic can be reduced.

The goal of the mathematical modeling process is to estimate the relationship $R$ using an approximating model $M$ constructed from a dataset $D$ and as et of hypothesis $H$ .

Approximating model:

M := M_{h} = M_{θ, \hat{h}} = X \to Y

Sample of inputs and associated values:

D = {(x_{i}, y_{i})}_{i = 1}^{n} \subset X \times Y

Set of trainable and non-trainable parameters:

h = (θ, \hat{h}) \in H

Aleatoric Inherent Uncertainty

$R$ is usually not deterministic, and the source of uncertainty is a property of the relationship/reality and thus can’t be reduced by providing more data or improving the models. The Aleatoric Inherent Uncertainty can be considered a function of both the input $x$ and the relationship $R$ :

x \to R (x) + ϵ_{A, I}, where ϵ_{A, I} := ϵ_{A, I} (x, R)

Contains randomness, we can’t do nothing about it.
Example: Consider a relationship $R$ which associates an integer $x$ with the value given by the sum of $n$ six-sided dice. $R$ is inherently random, since the same input $x$ may correspond to different outputs $y$ .

Aleatoric Experimental Uncertainty (or Noise)

It refers to the variations in information content caused by the acquisition process like artifacts, errors in labels, measurements errors, experimental settings, etc.

This uncertainty affects both the input and output, making the problem almost mathematically intractable. But i is possible to change the perspective of the problem:

The original would be:

X + ϵ_{A, N, x} \to R (x) + ϵ_{A, I} (x, R) + ϵ_{A, N, y}

the shift would be:

X + ϵ_{A, N, x} \to R (x) + ϵ_{A, I} + ϵ_{A, N, y} := \hat{R} (x) = ϵ_{A, I} (x, \hat{R}) + ϵ_{A, N, y}

e.g.
We use two rulers each have it’s own uncertainty. We assume the first measure is exact, all the error is translated in the second measurement, plus the error propagation of the first.

Aleatoric Model Uncertainty

Aleatoric uncertainty can also be generated by the model $M$ during the inference process. This is due to the pseudo-random propagation of rounding errors on a machine.

Aleatoric Model Uncertainty ( $ϵ_{A, M}$ ) is often negligible compared to the other components of aleatoric uncertainty. We will consider only the inherent and the noise.

Epistemic Uncertainty

Epistemic Uncertainty can also be called subjective, reducible or systematic.

Can be induced by the a priori choice of hyper-parameters characterizing the model or by a finite amount of available data.

Since this uncertainty is due to a lack of knowledge it can be reduced by improving the model hyper-parameters or by increasing the size of the dataset (i.e. it’s the reducible part of total uncertainty)

The approximated model is

M := M_{h} = M_{θ, \hat{h}} : X \to Y

$X$ and $Y$ are the representation of the whole space. We are assuming to train with infinite data.

We focus on the model’s parameters, we have $θ$ which are the trainable parameters (weights and biases), and $\hat{h}$ which are the non-trainable parameters, like which model we are using, the model’s geometry, the loss function, and other hyper-parameters.

We fix hyper-parameters $\hat{h}$ , we have to choose the optimal $M$ which is fully characterized by its trainable parameters $θ$ .

We want to find $θ *$ which is the one that minimize the loss. Formalizing the problem:

θ * : ar g min E_{l} (θ, \hat{h}) \to

E_{l} (θ, \hat{h}) := \int_{X \times Y} l (M_{θ, \hat{h}}, y) p (x, y) d x d y

we obtain the optimal bayesian predictor:

M_{θ^{*}, \hat{h}}

with fixed, established hyper-parameters and optimal trainable parameters.

Epistemic Model Uncertainty

Even with optimal trainable parameters and even assuming a null aleatoric uncertainty (inherent + noise) $ϵ_{A}$ there could be a discrepancy between the optimal reality and the model outcome: $R (x) \neq = M_{θ^{*}, \hat{h}} (x)$ .
The discrepancy between the Optimal Bayesian Predictor and the reality is called Epistemic Model Uncertainty $(ϵ_{E, M})$ :

M_{θ^{*}, \hat{h}} = R (x) + ϵ_{E, M}

even with an infinite amount of data, when you project some amount of information in the problem (you choose a structure, you choose a loss function, etc.), you will **always approximate the reality**. You will never reach the reality. Thus, the $\epsilon_{E, M}$ can be reducible, but cannot be deleted.

So the Optimal Bayesian Predictor is the best approximation of reality we can get.

Epistemic Approximation Uncertainty

The approximating model $M$ requires full knowledge of the space $X \times Y$ , which is usually not available to the experiments and we have to use a dataset. Information about the space $X \times Y$ is provided by a sample called $D$ , the training dataset:

D = {(x_{i}, y_{i})}_{i = 1}^{n} \subset X \times Y

Once the hyper-parameters $\hat{h}$ are established , the goal is to induce an optimal trainable parameters $\hat{θ}$ on dataset $D$ .

\hat{θ} := ar g min \hat{E}_{l} (θ, \hat{h}) \approx θ^{*}

$\hat{θ}$ is the approximation of optimal trainable parameters. Which are obtained trough the minimization of the loss/risk:.

$\hat{θ}$ is an approximation of $θ^{*}$ . The discrepancy depend on the quality and the number $N$ of data available and is called the Epistemic Approximation Uncertainty ( $ϵ_{E, A_{p}}$ ):

M_{\hat{θ}, \hat{h}} = M_{θ^{*}, \hat{h}} + ϵ_{E, A_{p}}

We obtain an empirical model

M_{\hat{θ}, \hat{h}}

where the trainable parameters are the best one obtainable on your data-set $D$ .
The discrepancy between $θ^{*}$ and $\hat{θ}$ depends on the quality and the number $N$ of data available and is called the Epistemic approximation Uncertainty( $ϵ_{E, A_{p}}$ )

M_{\hat{θ}, \hat{h}} = M_{θ^{*}, \hat{h}} (x) + ϵ_{E, A_{p}}

This is also called interpolation uncertainty and can be reduced by improving quality and size of the dataset.

Balancing Epistemic Uncertainty

The empirical distribution $p (D)$ obtained from $D$ and used to train the model, imperfectly mimics the real underlying distribution on $X \times Y$ (i.e. $p (D) \approx p (X \times Y$ )

Increasing the dataset size $N \to in f$ improves the approximation, reducing the Epistemic Approximation Uncertainty $ϵ_{E, A_{p}} \to 0$ but cannot reduce the epistemic model uncertainty.

Increasing the complexity of the model (increasing the number of trainable parameters $θ$ ) improves the model’s ability to approximate a deterministic relationship $R$ .

For ANN, the Universal Approximation Theorem ensures that a sufficiently complex Neural Network can perfectly approximate every relationship $R (ϵ_{E, M} \to 0)$ .

However the increasing complexity requires a huge amount of data to be trained on, resulting in a dramatic increase in Epistemic Approximation Uncertainty $(ϵ_{E, A_{p}} \to in f)$ .

Summary of the uncertainties

y = R (x) + ϵ_{A, I} + ϵ_{A, N}

Here $R$ is the representation of reality that we would like to reach. It’s always affected by the inherent aleatoric uncertainty and the noise. This are non-reducible

M_{θ^{*}, \hat{h}} (x) \approx y + ϵ_{E, M}

Here, we have the epistemic model uncertainty that is related to the choice of the model by fixing hyper-parameters, geometry etc. However, by having an infinite amount of data we find the optimal bayesian predictor, which has the best set of trainable parameters.

M_{\hat{θ}, \hat{h}} \approx M_{θ^{*}, \hat{h}} (x) + ϵ_{E, A_{p}}

And here we also have the “problem” of the amount of data, because we can only train our model with sampling from reality. (Epistemic Approximation Uncertainty)

The optimal bayesian predictor is the best approximation of R on the whole space
Empirical model best approximation of predictor on the dataset D.
The equation that contains the full relationship between the actual model and the reality can be summarized as:

M_{\hat{θ}, \hat{h}} (x) \approx R (x) + ϵ_{A, I} + ϵ_{A, N} + ϵ_{E, M} + ϵ E, A_{p} := R (x) + ϵ_{A} + ϵ_{E}

Can be summarized in the sum of the aleatoric uncertainty and the epistemic uncertainty (predictive posterior uncertainty). However uncertainties are not additive terms, because they must be considered as nonlinear functions, so they also have a multiplicative component. Our aim is to reduce uncertainty.

Techniques to handle Uncertainty

Are divided in:

Intrusive (By design);
- complex implementation;
- computationally expensive;
- effective;
Semi-Intrusive;
- effective and easy to implement;
- computationally expensive;
Non-Intrusive (Post-hoc);
- easy implementation;
- computationally less expensive;
- inferior performances;

📚 Michele's Notes

Explorer