Classical Deep Architecture can be seen as a Bayesian network / using Bayesian formalism. Internal score (logits) are not to be used as a Bayesian estimate.

From Classical Probability to Bayesian Theory

Probability of a given event is the probability of that event w.r.t. of all possible events.
is the sample space (all that can potentially happen)
the sigma algebra (Tribù) of events
the probability measure, which:

  • ,

is the space of all events
Events are all the combination of outcomes you’re interested in.

Mass Probability Function

It is to provide the description of the probability we are interesting in.
The probability mass function of a fair die is:

From Discrete to Continuous distribution

In the continuous case we have to define a mass function density:

The first part is the interval, the second part is the density. Some famous density function are uniform distribution, the log-normal or the Gaussian.

Shannon Entropy

Entropy is a measure of the amount of knowledge. You can use the Shannon entropy to find the family of distribution for your data.

Definition. Given a discrete random variable , with possible outcomes the Shannon Entropy of is defined as:

Given a random variable with probability density function whose support is the differential entropy (or continuous entropy) is defined as:

where the base of the logarithm can be chosen freely. Common choice include or .
A maximum entropy probability distribution has entropy that is at least as great as that of all the members of a specified class of probability distributions.

Density in

We can look the shadows on each dimensions. Ogni dimensione è una random variable:

This doesn’t let you study co-variance. You need to study simultaneously in every dimension, so, we are going to take a strong assumption, that the directions are independent.

Central Limit Theory

You can ensure that anything becomes a normal distribution by doing a lot of sampling.
Every time you take multiple variables and average it or sum it, you get a gaussian distribution. 1 die has a uniform distribution, if you take 2 and average them you get a gaussian.

Suppose is a sequence of i.i.d. random variables with and . Then, as approaches infinity, the random variables converge in distribution to a normal .

Random Variable

possibilities and event can be too complex, sometimes intractable.

A random variable is a measurable function, which moves the original settings to a easier one, while maintaining some properties of the original space.
e.g.

e.g. If I get 7 by summing two dice I win, otherwise I lose. I don’t care about the space containing all the probabilities, I just want to work with binary possibilities: I win or I lose.

Conditional probability

The probability of the two events happening at the same time is the probability that A after B happens once B happened.

What is the probability of getting a 2 on a D6, knowing that we got an even number?

Conditional Probability works also for continuous r.v.

If A and B are independent then . e.g. what is the probability of getting a 2 on a D6 knowing that I got a brown die.

Bayesian Formula

Posterior is , is the likelihood, is the Prior, and is the evidence.

Likelihood, plauisibility of given a certain outcome when my hypothesis is true.
The prior is what we believe, and likelihood is where we are right.

Likelihood is the probability of observing the data given the current weights.
Likelihood and loss function are almost the same thing, they are strongly related, we want the model to maximize the likelihood by minimizing the function of the likelihood called loss function.

The prior in a NN setting is the starting weights.
The posterior is the set of final weights given the input data.

The likelihood is: are my weights given the data?
The second part is the density function, the density of the output.
By definition the likelihood of a given models, given new data, is the “probability” of obtaining those data from the weights of the model.

We assume that each data is independent and identically distributed. And we can rewrite like on the right.

If you have a good prior bayesian learning is very good otherwise bad, you need a lot of data to correct.

Predictive Posterior Distribution

We are modeling , not like we do in the posterior which are the weights given the dataset. In inference we plug in the network and we got . We are modeling , the distribution of the predictive posterior.

Artificial Neural Network

Copy Anatomy of a feed forward ANN.

We need a non-linear f because if we don’t have it’s just a linear matrix multiplication and addition and that could be collapsed to a single layer.

The bias is optional.
It acts as a threshold under which the next layer doesn’t change that much. It helps stabilizing.

Recipe of ANN

  • A set of hyper-parameters and trainable parameters.
  • Some non linearity.
  • An objective function (cost/loss)
  • An optimization method
  • Sufficient data

The family of neural networks is dense in the function space, any kind of function can model any function.

Feed forward networks with non-polynomial activation functions are dense in the space of continuous functions between two Euclidean spaces, w.r.t. the compact convergence topology.

ANNs training in supervised learning

We need a dataset we need to define a measure which compares the quality of the predictions, e.g. MSE.
We also need a way to teach the model how to update weights. We use the gradient, i.e. the best direction of update for the objective function.
Computing the derivative is usually intractable, so we use back-propagation algorithm.

Known Losses

RMSE for regression works really well
Cross Entropy for classification

The loss is the key of NN.

Reconstruction
we use RMSE