The time of datasets has passed.

Pre-Dataset Era

Around the turn of the century, things started to change, we went from (Computer Vision) paper that had only images in the results, to paper that actually have measures of performance, making Machine Learning into a quantitative scince.

The Dataset Era

*When you can measure what you are speaking about and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of the meager and unsatisfactory kind. * - Lord Kelvin.

Dataset-Driven Breakthroughs:

  • SIFT
  • DPMs
  • ConvNets
    But at first, datasets were mainly for evaluation, they were benchmarks.

Training vs Testing Dataset

Even if you have a test set, the model trained on the training set performs poorly on other datasets.

Goodhart’s law:
When a measure becomes a target, it ceases to be a good measure.

In Academic ML: the dataset is the world.
In practice: the dataset is a representation of the world.

And sometimes it can be a bad representation, because all data is biased!
Images comes from internet, etc. which have a bias in it.

Everytime you capture images you put bias in the capture. You can’t unbiased it.

Sampling bias: most photos are take in the USA, Central Europe and Japan.
Photographer bias: people want their pictures to be recognizable, and beautiful.
Social Bias: there are tradition or ways picture are usually taken.

Dataset Bias is real and is not going away, it’s just a small part of the whole distribution. e.g. ImageNet over-samples sports car.

A model trained on ImageNet and tested on ImageNet v2 which was created to be as close as possible from the first, by replicating the way the created the first dataset, and by trying to get their biases the same as close as possible, still the model performances drop drastically.

Classifier love to cheat, watch as some statistical texture of the image and however you perturb the texture the model’s own prediction doesn’t change.

For video, the model cheats even more so.

Dataset bias will not go away, all dataset are finite, so there will always be ways to cheat. Learning becomes pair memorization
We are raising a generation of algorithms who can only “cram for the test” (set).

Is memorization that bad?

For some cases, maybe it’s not. If men instead of brains had a lookup table, (kNN converges to Bayes risk, Cover & Hart, 1967 is the best you can do), this is the way that eventually get’s better. In a static planet, this would be the perfect way to do it, but our planet changes constantly.

Our brain is a generalist.

What can we do?

  • Get rid of tests (Montessori school), no objective → no way to cheat Unsupervised Learning.
  • Relentless daily homework, instead of final exams → continual, online learning. Instead of human-labeled data, self-supervision.

Why do we have vision?

  • To see what is where by looking. Aristotle, Marr, etc.
  • …
  • To predict the world - Jakob Uexkull, Jan Koenderink, Moshe Bar, etc.
  • …
  • To make babies who make babies - Darwin Dawkins

The world as supervision, trying to predict some aspect of the world that we interact with/have effect on.
Self-supervision is turning everything into a prediction problem.

Sentience - J. Koenderink

Sensory-action worlds

Each organism has its own umwelt or “surrounding world”, this is the wrganism’s sensory and action world. it is determined by biology “Bounds the universe from the perspective of the animal”.
It’s naturally going to be different between organisms, and none of them is going to be “correct”.

The tick has only two sensor: temperature and smell, for it his world is this two dimensional world. The tick’s umwelt is a boring one, however, it’s very successful in doing what it does.

co-evolution of umwelts for example the umwelt of the spider is the tension of its strings, in the cobweb, that is invisible for the fly that doesn’t have the cobweb in its own umwelt, and the spider eats only because of these.

the diagram is basically a state-action model where you get input from the world, you do an action, and you have a feedback, you predict how your action changes the state of the world.

the better you world model is, the less the real world matter, you want it only for when you get the prediction wrong, if you are always correct, the real world doesn’t matter.

Why no fixed dataset?

Real-world motivation:

  • biological agents never see the same data twice.
  • every new piece of data is first “test”, and then “train”.

Repeating the same sample might encourage memorization and discourage generatization:

  • for self-supervised learning, no excuse to use multiple epochs.

But what about consistent evaluation?

  • Evolving test sets is good enough for healthcare, polling stock market. error bars are our friends. If they can do it, so can we.
    You can basically train, and every new piece of data you validate on, and then train on.

What about catastrophic forgetting?

Case 1: you don’t want to forget old data:

  • then your distrbution is probably stationary, your sampling is not i.i.d., use a bigger buffer to turn things i.i.d.
    Case 2: the domain shifts:
  • if your distribution is actually shifting then you probably need to forget.

Domain shifts cannot be abrupt.
But we are not universal adaptors (observational bias). See Dubey, ICML’18 for impossible games. We need another constraint: smoothness just like in evolution we need smooth change, and not abrupt.

On Infinite Data Stream

It’s good to train on test data because there is no difference.
They attempt to operationalize online learning on an infinite smooth dataset is inspired by one sample learning.
They took a single image, they half size it, and then they train a small network to upsample the original and then the image and then they apply to itself. Even if has seen less data, it saw the only data it needed. It overfits to the only thing it has to do.

JIT-Net it’s ok to keep overfitting,

So the idea is to first train (a bit), then test.
it adapt to the future once it arrives.

Test-TIme Training

Standard test error:

For a test distribution :

  • the test sample gives us a hint about ;
  • no fixed model, but adapt at test time
  • One sample learning problem
  • No label? Self-supervision!

Algorithm for TTT

You train on the single sample by rotating and perturbing it, and predict the correct one.
You fine-tune the model and then make the prediction on the actual sample.

It seems there is not saturation, and there is not overfitting but the prediction doesn’t going down.

We assume there is some sort of correlation between the main tasks and the self-supervised task.

they go back to , the base model, after every sample. You would like not to resent, you want no reset, you want online learning, smoothly changing data stream.