Traditionally to evaluate generalization we use the training/validation/test data, where unseen data was used as test data to see if the model generalize.
Natural Language Inference
two sentences and has to predict the relationship of those two sentences.
HANS
Dataset for NLI with a premise, a hypothesis and than you have to label if there is entailment
on HANS model were doing really well.
rather than solving the task the models were latching on superficial pattern like overlapping between the two sentence and always predict entail.
So scores on in-distribution doesnât guarantee scores on out distribution, then what?
from i.i.d. and o.o.d generalization, but in LLM we donât have train-test splits, fixed training data, nor uncontaminated training data (data from evaluation in pre-training)
How to evaluate generalization in LLMs?
With no training data
the most common scenario: we donât have no access to training data. we can estimate or âassumeâ stats from the training corpus.
Can Transfomer process recursive nested construction
itâs not robust against contamination issues.
Using synthetic data.
Non robust against contamination and itâs difficult to understand what will actually not appear in the training corpus.
GSM-Plus dataset
Is an adversarial dataset (changing numbers in the problem, add useless sentences in the problems and rephrasing the questions) (they can compare the original versus the perturbed).
also not robust to contamination
A holistic evaluation of consistency and interaction in prompt-based learning.
LMentry: homophones accuracy on different prompts.
there is some overfitting, since the across prompts the accuracy varies a lot.
For multilingual multiple studies have been conducted but there are representation of the semantics and knowledge, but simply the language that are more similar in form also have nearer word embeddings and thus knowledge is passed trough language in a shallow way.
Consistency across ârepresentationâ, knowledge is tied to forms, (next word prediction), and we donât have a deeper knowledge.
They compare consistences on couple questions one original one generate (paraphrasis or translation) and checks if the questions are consistent.
They use as a baseline by asking the model two times the same thing.
they are not consistent.
If we ask a question in the language that the question relates to (asking whoâs the president of italy, in italy) the accuracy is much higher, so there is not really a deeper representation of meaning but itâs very related to forms.
When a model gives you an incorrect answer isnât that likely to be spread across different sources in different languages. If you ask whoâs the president of Germany that information is in every language on wikipedia eg.
the consistency of correct answers is higher than consistency of incorrect answers, probably in part is in fact due not to generalization but the model inferred that separately from different language.
This tests only provide negative proof but if you obtain good scores is not positive evidence that the model is actually generalizing.
If we have training data:
Search for specificially for things that are not represented in the training data. But itâs hard (Impact of pretraining term frequencies on few shot reasoning).
thereâs a correlation betwen the frequency of the term and the avg. accuracy so there is no real generalization.
they checked for overlapping substrings by using n_gram overlap and they consider that contaminated if the same ngram is in both in training and the evaluation data. they estimated the gain in performance by having contaminated data.
the threshold needs to be very low
even a little contamination helps the model a lot.
but if itâs that little, is it really contamination? or has the model learned to effectively apply that?