Quantifying context mixing in transformers

From @mohebbiQuantifyingContextMixing2023.

The Transformer Model

Transformer models in each node, at each layer, mix context information into the representation of each token.
Value Zeroing is a novel approach to quantify the contribution of each context token in determining the final representation of a target token, at each layer of a Transformer.
Value Zeroing is based on the Explaining-by-Removing intuition (Covert et al., 2021) shared by many posthoc interpretability methods, but it takes advantage of a specific feature of Transformers: it zeroes only the value vector of a token $t$ when computing its importance, but leaves the key and query vectors (and thus the pattern of information flow) intact.
This enables the ability to zero-out a token, without removing it from the sentence and thus changing its semantics.

So, inside the Self-Attention Layer of a Transformer, for each attention head $h$ , the input vector $x_{i}$ for the $i^{t h}$ token in the sequence is transformed in three distinct vector through the use of three different sets of weights:

The Query vector $q_{i}^{h}$ ;
The Key vector $k_{i}^{h}$ ;
The Value vector $v_{i}^{h}$ ;

Then, the context vector $z_{i}^{h}$ for the same $i^{t h}$ token in the sequence is generated as a weighted sum over the Value vector:

z_{i}^{h} = j = 1 \sum n a_{ij}^{h} v_{j}^{h}

where $a_{ij}$ is the raw Attention weight assigned to the $j^{t h}$ token in the sequence and computed as a Softmax-normalized dot product between its Query and Key vectors.

Then, the context vector for the $i^{t h}$ vector is computed by concatenating all the heads’ outputs, followed by a projection through an output weight matrix $W_{O}$ and layer normalization:

z_{i} = CONC A T (z_{i}^{1}, ..., z_{i}^{H}) W_{O}

z_{i} = L N_{M H A} (z_{i} + x_{i})

Finally, the context vector goes through a MLP that first project the input to a space that is traditionally four times larger, goes through a ReLU, and its projected back to the model’s hidden size. This is applied to every $z_{i}$ , producing the representation $x_{i}$ . This is again Layer Normalized:

x_{i} = L N_{M L P} (x_{i}, z_{i})

Value Zeroing

In Value Zeroing the weighted sum over the Value vector (The first Equation of the page) is changed by replacing the Value vector associated to $j$ with a zero vector $v_{j}^{h} \leftarrow 0, \forall h \in H$ when computing the context vector for the $i^{t h}$ token.

This provides a new representation $x_{i}^{\neg j}$ that has exluded the $j^{t h}$ token. Comparing it with the original representation $x_{i}$ using a pairwise distance metrics (such as the coside distance) we obtain a measure of how much the representation changed by the exlusion of the $j^{t h}$ token.

C_{ij} = cos in e_d i s t an ce (x_{i}^{\neg j}, x_{i})

Computing this equation for each $i, j$ token in the sequence we obtain a Value Zeroing Matric $C$ where the value of each cell indicates the degree of saliency to which the $i^{t h}$ token is dependent on the $j^{t h}$ token to form its contextualized vectorial representation.

In particular, we obtain a $C$ matrix for each Self-Attention layer in the network.

To obtain a single matrices that represent the co-dependency of the token representation across the network a technique called Rollout is used.

Rollout

From @abnarQuantifyingAttentionFlow2020

Is an attention matrix aggregation technique where each attention matrix are multiplied in order, one after the other, obtaining a final matrix.

📚 Michele's Notes

Explorer

Quantifying context mixing in transformers

The Transformer Model

Value Zeroing

Rollout

Graph View

Table of Contents

Backlinks