Interest in understaining th emebdding space organization. (Word vector arithmetic).

LLMs seem not to be organized like static embeddings and more organized by artifact of training (and not by semantics). More by postition and frequency, more common token are clustered togheter.

We don’t have semantically meaningful direction they occupy a “narrow cone”. The variance in the emebedding space are not uniformely distributed (not isotropic), some dimension are more important than other.

This may be harmful for LLM.

Transformare based model hav ea different geometry consdiered bad but tthey still work better.

IsoScore: measuring the uniformity of the embedding space utilization.
Isotropy: a distrbution is isotropic if the variance of the data is uniformly distributed (i.e. the covariance matrix is proportional to the identity matrix).

The convention if you have a very high cosine dsimilarity in space, than the angle between the points is very very narrow, but that isn’t a good way to measure isotropy.

  1. Point cloud of sentence cloud
  2. Projected using PCA
  3. Compute the covariance of the PCA;
  4. Normaliza the diagonal ofPCA to have the same norm as (1,1)
  5. Calcluate the maximum distance between this value and (1,1)
  6. Rescale the value obtained to be in the interval [0,1] obtaining a value for measuring Isotropy
    The score is robust to mean, to maximum variance, to the number of dimension, changes to covariance, is ortaion invariant and global stability.

Isoscore measures isotropy.

Models are less istropic than previously though.

Now they are trying to using isoscore asa regulazier to shape the embedding space during training.
Iso score doesn’t work well on a small batch of points which is what is used during training.
They stabilize IsoScore on mini-batch computation tahnks to Regularized Discriminant Analysis (Friedman 1989).

This method does actually stabilizes isotropy calculation. Even with small sample size for RDA.

Lambda controls if you want to increase or reduce isotropy.
If you increase the model get worse. So isotropy doesn’t seem to be a good thing. Encouragin the embedding to be in a low isotropy seem to work better.

Anisotropy seem beenficial, Lo wintrinsic dimensionality in later layers correlates to better model performance.

IsoScore is incompatible with clustering objectives.

Lower IsoScore = better silhouette score (bettere clustering).


Outlier dimension are dimensions in LLM representations whose variance is a t least 5x times larger than the average variance in the global vector space.

The origins in this effect could be:

  • Token imbalance: input distribution matches output distribution, However visual transformer have outliers;
  • Layer norm has a bias parameter in later layers of LLM that amplifies outlier dimensions. However models with RMS-Norm still have outliers;
  • ADAM: The depdendece on the second moment causes outlier activations;

Fine-tuning increases variances in outliers drammatically. May be that these outlier class may be useful for classification tasks (and this seem to be consistent across various fine-tuning classification tasks).

They tested this knowledge by passing using a linear threshold on the largest dimension to classify. They see various performance across models, in some models this impacts performance on classifcation very little, some other cases it performs much worse.

There is a high correltation with high variance w.r.t. to classification performances with 1D.


Factual assocaition are knowlede tuples with (subject, relation, object).

Causal mediation analysis relies on curating minimal pairs. These pairs are used to calculate patching effects and measure model editing success.
Counter fact is created using knowledge tuples. Using it show that middle layer mlps are largely responsible for storing factual association.

They creating visual conterfact.