From @mohebbiQuantifyingContextMixing2023.
The Transformer Model
Transformer models in each node, at each layer, mix context information into the representation of each token.
Value Zeroing is a novel approach to quantify the contribution of each context token in determining the final representation of a target token, at each layer of a Transformer.
Value Zeroing is based on the Explaining-by-Removing intuition (Covert et al., 2021) shared by many posthoc interpretability methods, but it takes advantage of a specific feature of Transformers: it zeroes only the value vector of a token when computing its importance, but leaves the key and query vectors (and thus the pattern of information flow) intact.
This enables the ability to zero-out a token, without removing it from the sentence and thus changing its semantics.
So, inside the Self-Attention Layer of a Transformer, for each attention head , the input vector for the token in the sequence is transformed in three distinct vector through the use of three different sets of weights:
- The Query vector ;
- The Key vector ;
- The Value vector ;
Then, the context vector for the same token in the sequence is generated as a weighted sum over the Value vector:
where is the raw Attention weight assigned to the token in the sequence and computed as a Softmax-normalized dot product between its Query and Key vectors.
Then, the context vector for the vector is computed by concatenating all the headsβ outputs, followed by a projection through an output weight matrix and layer normalization:
Finally, the context vector goes through a MLP that first project the input to a space that is traditionally four times larger, goes through a ReLU, and its projected back to the modelβs hidden size. This is applied to every , producing the representation . This is again Layer Normalized:
Value Zeroing
In Value Zeroing the weighted sum over the Value vector (The first Equation of the page) is changed by replacing the Value vector associated to with a zero vector when computing the context vector for the token.
This provides a new representation that has exluded the token. Comparing it with the original representation using a pairwise distance metrics (such as the coside distance) we obtain a measure of how much the representation changed by the exlusion of the token.
Computing this equation for each token in the sequence we obtain a Value Zeroing Matric where the value of each cell indicates the degree of saliency to which the token is dependent on the token to form its contextualized vectorial representation.
In particular, we obtain a matrix for each Self-Attention layer in the network.
To obtain a single matrices that represent the co-dependency of the token representation across the network a technique called Rollout is used.
Rollout
From @abnarQuantifyingAttentionFlow2020
Is an attention matrix aggregation technique where each attention matrix are multiplied in order, one after the other, obtaining a final matrix.