2025.06.26 - Lecture 3

We pass from the imprecise architecture of the base ranker that gives us a first cut of document to a second stage with a much more precision-oriented ranking.
The idea is to give Query + Matching docs to a first recall-oriented ranking stage 1, then pass the top- $N$ documents + the query to a second precision-oriented ranking stage 2 that gives us the top- $K$ documents.

This was how the Efficiency/Effectiveness Trade-off was solved before LLMs.

The Main research lines in this trade-off is:

Optimizing efficiency within the learning process. Create a model of learning that is efficient aware, so that it’s trained to be fast.
approximate score computation and efficient cascades. We want to approximate the score of the relevance by the model, and change the cascade to be more efficient.
efficient traversal of tree-based models. Since that the SOTA is tree based models, can we work on the algorithms on how to navigate, aka make inference, on these models.

Optimizing Efficiency within the Learning Process

Learning to efficiently rank [WLM10]. Wang et al., 2010 propose a new cost function for learning models that directly optimize the trade-off metric: efficiency-effectiveness trade-off metric:

EET (Q) = \frac{( 1 + β ^{2} ) \cdot ( γ ( Q ) σ ( Q ))}{β ^{2} \cdot σ ( Q ) + γ ( Q )} \to MEET (R) = \frac{1}{N} \sum EET (Q)

It’s an harmonic metric. $γ (Q)$ it’s a measure of quality for the query, $σ (Q)$ maps how fast I am in ordering that query. It’s a function that goes from 0 to 1 where ms are mapped to value from 0 to 1, where 1 means that it’s very fast, 0 it’s very slow. $β^{2}$ is an hyper parameter.
Then MEET averages it for all the query in the training set.

This work focuses on linea feature-based ranking functions. The learned functions show significant decrease average query execution times.

Cost-Sensitive Tree of Classifier

The cost of feature extractor is not the same for each feature, some are simpler than others. Xu et al. 2013 observe that test-time cost of a classifier is dominated by the computation required for feature extraction.
So, they extracts different feature for each path in the tree (those needed by that path), so they look and extract only those feature. This reduces the average test-time complexity. Input-dependent feature selection and there is a dynamic allocation of time budgets where higher budgets are for infrequent paths.

Training Efficient Tree-Based Models for Document Ranking [AL13]

Asadi et al., 2013 propose a technique for training Gradient Boosting Regression Tree that have efficient runtime characteristics: shallow, compact and balanced they yield faster predictions.

So they use a Cost-Sensitive Tree Induction: both minimize the loss and the evaluation cost.

They attack it with two strategies:

They modify the node splitting criterion during tree induction (allow split with maximum gain if it does not increase the maximum depth of the tree, find a node closer to the root which, if split, result in a gain larger than the discounted maximum gain)
Pruning while boosting with the focus on tree depth and density (additional stages compensate for the loss in effectiveness, collapse terminal nodes until the number of internal nodes reach a balanced tree)

The pruning approach is superior, obtaining 40% decrease in prediction latency with minimal reduction in quality.

CLEAVER [LNO+16a]

Lucchese et al., 2016 propose a pruning & re-weighting post-processing methodology. The propose several pruning strategies. This pruning happen in post-processing after training:

random, last. The firs tree, in GBoost, give a lot of contribution, while the last fixate on details, but they contribute less;
skip, low weights. Skip is one every X, low weight is remove those tree that have a low partial score very frequently;
score loss.
quality loss. If I remove a tree and I calculate the NDCG again, how much I loose? if low, remove it.

Greedy line search strategy applied to tree weights. They then give different weight to each tree on the final score.

The combination of the two (with the quality loss method) yield same effectiveness of the original model with up to 20% less trees.

X-CLEAVER [LNO+18]

Lucchese et al., 2018
Pruning and re-weighting during gradient boosting:

Redundant trees are removed from the given ensemble;
Weights of the remaining trees are fine-tuned by optimizing the desired ranking quality metric, i.e., NDCG;

Same pruning strategies of CLEAVER.
X-CLEAVER allows to train even more compact forests with no loss in performance.

DART [VGB15]

Rashm & Gilad-Bachrach propose to employ dropouts from NN while learning a MART(Multiple additive regression tree): DART.
The initial trees of the model are the one that contributes more averagely than the final ones. The contribute of the first trees of the model is higher.

Dropouts as a way to fight over-specialization, shrinkage (Introduction of tree weights with learning rates) helps but does not solve.

Dart differs from Mart:

When learning a new tree, a subset (paths) of the model (previous models already fitted) is muted (random);
Normalization step when adding a new tree to avoid overshooting. This is needed because since the model is trained with turned-off models, and it’s score can overshoot, since it compensate the score of the muted trees. So normalization is needed.

This balances the tree way more, making them almost equally important for the prediction.

They do better than LambdaMART which was the SOTA using shrinkage.

X-DART [LNO+17]

Lucchese et al., merge DART with pruning while training. Like DART, some trees are muted and this set is removed after fitting if needed. So they mute it, and the evaluate it if pruning it or re-introduce it.

X-DART builds more compact models than DART, smaller models are less prone to overfitting: potential for higher effectiveness.

Three strategies for pruning trees (ration, fixed, adaptive).

They have statistically significant improvements w.r.t. DART with up to 20% less trees when using adaptive. If they push it to 40% less trees they obtain the same effectiveness.

Approximate Score Computation and Efficient Cascades [CZC+10]

Cambazoglu et al., 2010 introduce additive ensembles with early exits using various kinds of thresholds (Score, Capacity, Rank, Proximity). It can reach four times faster without loss in quality.

I want to rank 1000 documents, but highly relevant documents are few. So, the distributions points towards the irrelevant, if I can kill the forest traversal for the irrelevant I gain a lot of time. Also, the users watch only the first few pages, so the ranking needs to be good for the first results w.r.t. to the bottom ones.

So, they include “sentinel” after a certain number of trees, and check whether it’s worth to keep going or to early exit with partial score.

All the threshold are techniques that on the training set accumulated information about the distribution of relevant vs irrelevant documents. I.e. what’s the score that can be used to discriminate relevant vs irrelevant on the training set? they use that as the threshold.

Joint Optimization of Cascade Ranking Models [GCBC19]

For now techniques optimize the stage 2 precision-oriented ranking. Here they try to train end-to-end the ranking architecture, training both stages.

Gallagher et al., 2019 observe that there is a cost-aware literature for LtR but no generalization to cascade. They propose a novel method for learning a globally otimized cascade architecture, using backpropagation end-to-end.

They use three cascade types: independent chaining, full chaining and weak chaining.

Globally learning the cascade achieves much better trade-offs between efficiency and effectiveness than previous approaches.

Efficient Traversal of Tree-based Models

[VPred] presented by Asadi et al., 2014 where it goes from control to data dependency the output of a test is used as index to retrieve the next node to processi. the visit is statically unrolled.

We lose the ifs, the branch predictor is no longer a problem, but caching still an issue.

QuickScorer [LNO+15] Lucchese et al., 2015. Given a document and a Query, each node of a tree can be classified as True or False.
Exit leaf can be identified by knowing all, and only, the false nodes of a tree.
From per-tree scoring to per-feature scoring. Per-feature linear scan of thresholds in the forests.

We take feature vector containing the threshold for that feature on each tree of the forest.

Then we can evaluate on each document on each of the feature vector threshold identifying which threshold holds True and False.
Doing it for all the feature we obtain all the false nodes of all the trees in the forest, obtaining all the exit leaf for each tree.

How? For each node they codify a bitmask that has 1 on the bit that represent the leafs that disappear if that node is false.
ANDing masks of leafs nodes lead to the identification of the exit leaf (Leftmost bit set to 1 in the resulting mask).
This operation is insensitive to node processing order.

it’s sequential in read-only so it’s cache friendly.

Vectorized Quick Scorer parallelized the “and” part on SIMD CPU

Multi-thread and GPU-Quickscorer same, but on GPU, with much greater speed up.

QuickScorer on FPGA

Neural Approaches for IR

Representation based approach
Each document and query are separately represented as dense vector trough a sequence of neural computations (offline for documents, and online for query). The final ranking is based on the similarity of the two representative vectors calculated with dot product.

Interaction based methods
The word or term-level similarity of a query and a document is explored online first based on their embedding vectors.

The final ranking is obtained by an additional sequence of neural computation.

MonoBERT

Query and document are jointly cross-encoded, and the model output tuned to provide a ranking score.

Input: [CLS] q [SEP] d [SEP]
Output: score

The contextual embedding of the [CLS] token is the aggregate relevance score of the query-document pair, to be directly used for ranking.

The Fine-tuning is done with a point-wise training where you estimate a score and then rank using it.

Duo Bert

The training is done by a loss where you calculate if document d_i is better than d_j
The loss is:

L_{d u o} = i \in J_{p os,} j \in J_{n e g} \sum lo g (p_{i, j}) - i \in J_{n e g}, j \in J_{p os} \sum lo g (1 - p_{i, j})

controllare ^^

Input: cls q sep d_1 sep d_j sep
output cls p_i,j = P(d_i > d_j|d_i, d_k, q)

All pairs of documents needs to be evaluated so we have a quadratic computational costs.

At inference time, document score is an aggregation.

Representation-Based methods

The ranking is done by a product of similarity between the dense vector representing the document and the query.

Single representation: one embedding for each document (ANCE).
Multiple representation: one embedding for each term of the document (ColBERT).

How are single-representation trained?
Pairs or triples of data per input sample (q, d-, d+)

Define a metric to measure similarity between embeddings so that close embeddings for the two inputs, in case the inputs are similar. Distant embeddings for the two inputs, in case they are dissimilar

Loss:

Distance between q and d+ should be 0
Distance between q and d- should be > m

L (q, d) = {d (q, d) ma x (0, m - d (q, d)) if d \in D^{+} if d \in D^{-}

Positive docs are known from supervision, but negative docs “how do you choose them?”
You could have clearly negatives, but hard negatives (aka document that seem relevant but aren’t) and this are the ones you want to know. It’s easy to spot easy negative.

You can select negatives from other batch or in batch. but this is limiting, you can draw hard negatives.

Static sampling: hard negatives are pre-computed before training and never change. The problem by selecting this using an inverted index, what happens is that during training the space shift and what was initial a hard negative can become easier.
Dynamic sampling: hard negative are the current top-ranked irrelevant documents produced by the dense model under training, computed at each training step.

We have a trained and an inferencer
The inferencer uses the checkpoint at k-1 to use the model to the whole collection, index them and query them. So the inferencer provides the trainer the k-most similar negative documents that uses them to continue training.

The retrieval then is done using knn to obtain the k-nearest neighbour (top-k documents)

ColBERT: Contextualized Late Interaction over Bert

Late interaction as a way to fight computational burden

Two encoders: query and document encoder;

Term based representation of documents: each document is a set of vectors pre-computed offline. While the queries representation are computed online.

The calculate a similarity between Q and D, and then do an approximate kNN retrieval.

s_{q, d} := i \in ∣ E_{q} ∣ \sum j \in ∣ E_{d} ∣ max E_{q_{i}} \cdot E_{d_{j}}^{T}

For each token of the query I sum the maximum dot product with all the tokens of the document. So for each token I find the maximum dot product I can find with the tokens of the document, the sum of these is the score of relevance.

📚 Michele's Notes

Explorer