Classical Deep Architecture can be seen as a Bayesian network / using Bayesian formalism. Internal score (logits) are not to be used as a Bayesian estimate.
From Classical Probability to Bayesian Theory
Probability of a given event is the probability of that event w.r.t. of all possible events.
is the sample space (all that can potentially happen)
the sigma algebra (Tribù) of events
the probability measure, which:
- ,
is the space of all events
Events are all the combination of outcomes you’re interested in.
Mass Probability Function
It is to provide the description of the probability we are interesting in.
From Discrete to Continuous distribution
The first part is the internal, the second part is the density.
Shannon Entropy
Entropy is a measure of the amount of knowledge. You can use the Shannon entropy to find the family of distribution for your data.
Density in
We can look the shadows on each dimensions
This doesn’t let you study co-variance. You need to study simultaneously in every dimension, so, we are going to take a strong assumption, that the directions are independent.
Central Limit Theory
You can ensure that anything becomes a normal distribution by doing a lot of sampling.
Every time you take multiple variables and average you get a gaussian distribution. 1 die has a uniform distribution, if you take 2 and average them you get a gaussian.
Random Variable
possibilities and event can be too complex, sometimes intractable.
A random variable is a measurable function, which moves the original settings to a easier one, while maintaining some properties of the original space.
e.g.
e.g. If I get 7 by summing two dice I win, otherwise I lose. I don’t care about the space containing all the probabilities, I just want to work with binary possibilities: I win or I lose.
Bayesian Theory
The probability of the two events happening at the same time.
What is the probability of getting a 2 on a D6, knowing that we got an even number?
Conditional probability
Bayesian works also for continuous r.v.
Bayesian Formula
Posterior is , is the likelihood, is the Prior, and is the evidence.
Likelihood, plauisibility of given a certain outcome when my hypothesis is true.
The prior is what we believe, and likelihood is where we are right.
Likelihood is the probability of observing the data given the current weights.
Likelihood and loss function are almost the same thing, they are strongly related, we want the model to maximize the likelihood by minimizing the function of the likelihood called loss function.
The prior in a NN setting is the starting weights.
The posterior is the set of final weights given the input data.
formula
The likelihood is: are my weights given the data?
The second part is the density function, the density of the output.
By definition the likelihood of a given models, given new data, is the “probability” of obtaining those data from the weights of the model.
We assume that each data is independent and identically distributed. And we can rewrite like on the right.
If you have a good prior bayesian learning is very good otherwise bad, you need a lot of data to correct.
Predictive Posterior Distribution
We are modeling , not like we do in the posterior which are the weights given the dataset. In inference we plug in the network and we got . We are modeling , the distribution of the predictive posterior.
Artificial Neural Network
Copy Anatomy of a feed forward ANN.
We need a non-linear f because if we don’t have it’s just a linear matrix multiplication and addition and that could be collapsed to a single layer.
The bias is optional.
It acts as a threshold under which the next layer doesn’t change that much. It helps stabilizing.
Recipe of ANN
- A set of hyper-parameters and trainable parameters.
- Some non linearity.
- An objective function (cost/loss)
- An optimization method
- Sufficient data
The family of neural networks is dense in the function space, any kind of function can model any function.
Feed forward networks with non-polynomial activation functions are dense in the space of continuous functions between two Euclidean spaces, w.r.t. the compact convergence topology.
ANNs training in supervised learning
We need a dataset we need to define a measure which compares the quality of the predictions, e.g. MSE.
We also need a way to teach the model how to update weights. We use the gradient, i.e. the best direction of update for the objective function.
Computing the derivative is usually intractable, so we use back-propagation algorithm.
Known Losses
RMSE for regression works really well
Cross Entropy for classification
Cercare e capire meglio.
The loss is the key of NN.
Reconstruction
we use RMSE