2025.06.25 - Lecture 2

Ranking is the ordering task.
Rank is the position of the object in the ranking.

What we want from a ranking

We want it to be:

High quality: users should be happy from the results;
Highly efficient: needs partitioned indexes and have low response times;
Easy to adapt: it needs to continue to crawl from the web, adapt to continuous users’ feedback, topic drift, etc.
Measurable and predictable: tail latency queries, quality/efficiency tradeoff;

Given a query $q$ and a set of documents $D$ , to rank $D$ so as to maximize user’s satisfaction $Q$ .

Maximizing $Q$ could make the process less efficient. Efficiency and effectiveness are orthogonal.

Document representations

a document is a multi-set of words, they could have fields, or can be split into zones. Additional info my be useful, in-link, out-links, page rank, social link, etc.

Ranking Functions

Term-weighting
vector space model
BM25, Language modeling
Linear combination of features.

Learning to Rank (LtR)

Is the task to automatically construct a ranking model using training data, such taht the model can sort new objects according to their degree of relevance, preference or importance. This is supervised-training.

The task is structured by a query $q$ , a set of document $D$ and for each $d \in D$ , we have a $y$ label that tells us how relevant is that $d$ for $q$ . The model need to learn the conditioned $y$ on $q$ .
An example is a set of real-valued features that represent a document, and has classes that goes from 0 to 4 (irrelevant to perfectly relevant). It aim to learn a function that mimic the ideal document ordering induced by the labels of training instances.

Evaluation Measures for Ranking

In a binary relevence labels, using precision as an evaluation metric we only check if the label is correctly predicted, 0 - 1 if it’s relevant or not, but we don’t check if it’s correctly ranked (relevant on top). Also, we don’t have a way to have measures, just 0-1

So, instead of using binary label we can use label 0-4. We use the evaluation metric by dividing the label (0-4) by it’s position, so, the higher there are, the higher the score if the label has high value.

the General formual is always

G ain (d^{r}) \cdot D i sco u n t (r)

Where we have the Gain (label value) and the discount based on position.
the simplest is (N)DCG:

(2^{y} - 1) \cdot \frac{1}{l o g ( r + 1 )}

The left part is Gain, the right part is the Discount.
(N) = Normalized.

Is it difficult?

We want to learn the ranking, not the label, it’s complex because ranking quality measures are not nicely derivable and gradient descent non applicable. We have huge datasets, hundreds of features.
Gradient is either 0 (sorted order did not change) or undefined (discontinuity).
So instead of learning the true ranking function, we use a proxy loss function that’s differentiable, and with a similar behavior of the original cost function.

Machine Learning for IR on Hand-Crafted Features

We have as input a vector of feature modeling relevance for query / doc + labels. State of the art approach forest of thousands regression trees.
High quality but their use is computationally expensive.

Point-Wise Algorithms

Each document is considered independently from the others. No information about other candidates for the same query. A different cost-function is optimized with several approaches: regression. multi-class classification, ordinal regression. An example is the Gradient Boosting regression trees where SSE in minimized.

Saltato tutta parte GBRT

tutta la parte di representation learning

Single-Stage Ranking

Requires to apply the learnt model to every matching document, and to generated the required features. It’s not feasible! Three trade-off to take in mind:

Feature Computation Trade-off: Computationally expensive & highly discriminative features vs computationally cheap & slightly discriminative features.
(Two Stage Ranking, first filter for top- $k$ then ranking) Number of Matching Candidates Trade-off (size of $k$ ): a Large set of candidates is expensive and produce high quality results vs small set of candidates is cheap and produce low quality results. For $n$ stage ranking, which model and feature, how many documents, at each stage? 200 configuration tested, best results seem to be N=3, 2500 documents between stage 1 and 2, and 700 between stage 2 and 3.
Model Complexity trade-off: Complex and slow high quality models vs Simple and fast low quality models;

📚 Michele's Notes

Explorer