An information retrieval (IR) product is (usually) a search engine.
A Search Engine is made up of a giant document collection that we want to rank based on a user query, returning the most relevant document for the user.
The Document Collection gets cleaned, tokenized and indexed. The objective is to respond to the query as fast as possible.
The “classic” Search engine is like:
Nowadays other steps are taken, like the collection having Feature Processor, creating a Document Features Repository, that can be looked up during the query process. Also, by using user’s reaction to previous training we can train a Ranking Function.
Query Processing
After pre-processing we have, for each document, a list of cleaned term that are ordered in the way they appear in the text.
Boolean Retrieval
Is a classical IR model, where Documents are represented as a set of terms. We lose the order of the term appear, and we lose their frequency.
Retrieval is based on whether or not he documents contain the query terms and whether they satisfy the boolean conditions (AND, OR, NOT) described by the query.
Good for expert user with a precise understanding of their needs and a deep understanding of their own collection. However, boolean queries often result in either too few (=0) or too many results. AKA if you use too many AND you get 0 results, viceversa for OR.
Ranked Retrieval
Instead of representing documents as sets, we represent them as multi-set, so we can return the documents by relevance, ranking them w.r.t. query.
How to order the documents w.r.t. the query? We use a score that is relative to the query.
Bag of Word (BOW) Model
Consider the number of occurrences of a term in a document:
- Each document is a count vector in where is the vocabulary, the number of different term in the collection. So we have a giant vector where, for each term of the vocabulary, we have the frequencies of each term in the document.
- Each count vector is a bag of words representing a document.
Term Frequency . The term frequency of term in document is defined as the count of times that occurs in . Raw term frequency is not usually what we want. Because Relevance does not increase proportionally with therm frequency. To better smooth the function, we apply a logarithm:
Document frequency . For each term we also count how many documents contain that term. In this way, words that are frequent everywhere have a low value.
Inverse Document Frequency:
with:
: is the set of all documents n the corpus
: total number of documents in the corpus ;
So, the score of a document for query is:
So, for each document we represent them with a -dimensional vector, where vocabulary terms are the axes of the spaces, and documents are points in this space. We have the TF-IDF of the term in the document as the values of the vector.
To obtain the best match 25 (BM25) known as Okapi BM25. This is a strong and popular baseline, we have:
where:
- is the number of times that the keyword occurs in the document ,
- length of the document in words
- avgdl: average length of the documents in the collection
- and are hyper-parameters (should be tuned on the collection) which are in [0, 1], typically 0.75, and in [1.2, 2.0]
Query Evaluation
From the collection of billions of documents we have a matrix that is size . Given a query, which is typically much smaller than a document, is transformed into a vector of size with (if not weighted) for the terms that appear in the query. We obtain the score by doing a dot product with each of the documents, obtaining the score of the document.
This is, however, both space (but solved trough the fact that is sparse) and time inefficient.
So, we use an Inverted Index.
Inverted Indices for Boolean Retrieval
For each term in the dictionary, we have a posting list that contains the document indices that contains that term. The list are memorized in order by document indices for both space and time efficiency.
Inverted Indices for BM25
For each document in the posting list we alzo memorize the term frequency in that document.
We’ll see just the exact query search, but modern search engine, do query expansion, basically they enrich the query with synonyms, plural or singular version of the terms, etc.
TAAT vs DAAT
Term-at-a-time (TAAT):
- Reads posting list one query term at a time;
- maintains an accumulator for each result document with a value/score. It’s usually stored in a hash;
- update the score every document of the current term;
It’s easy to implement, but generates a lot of cache misses because many accumulators (jumps in memory), and no possibility of skipping docIds for selective queries.
Document-at-a-time (DAAT):
- Scans posting lists of all query terms in parallel;
- computes score when the same document is seen in one or more posting lists;
- advances many posting lists simultaneously;
- maintains a sorted list of results;
NextGEQ, guardare.
Skip pointers:
- Place a skip pointer to skip elements. It can be used with compression by compressing an entire block.
Fino a slide 41 da riguardare.
We use a Mini-Heap to keep the K Largest values in a sequence.
The threshold get’s higher and higher, because it’s the worst between the best.
This strategy costs where is the top- we want. While ordering everything and extracting the top- costs .
Exact top-K retrieval: WAND
MaxScore
The great advantage w.r.t. WAND is that you don’t need to re-order.
Guarantees to compute Exact top- but skips many postings. The Idea is, given the current threshold , split the posting lists into essential and non-essential.
The non-essential lists are those that the sum of their upper bounds is . We execute the query only in the essential, we sum it’s score in the essential, and, if it’s upper bound is higher than we look for it even in the non-essentials.
We can memorize the upper bounds per block instead of per posting, and have more occasions for skipping.
Data compression in Modern Search Engines
The Compression ratio is defined as
Problem. We can’t use string-based compression algorithm. We want to use integers.
We are given an integer , and we have to design an algorithm that represent in as few possible bits.
The 60% of values in posting lists are .
Codewords. The bit-string representing according to the chosen code, it’s indicated with . A sequence consisting of integers will be coded as the concatenation of the codewords assigned to , i.e. .
Static codes because they always assign the same codeword to the integers , regardless the sequence to be coded.
The binary string have fixed length, bits suffice to represent any number in . is the binary representation of and is the number of bits necessary to represent , i.e. bits. is the optimal solution but it’s ambiguous to decode, so we need to create a that is as close as possible to it. To do that, we need fixed length to know where each number terminates. But we want variable length, which is space efficient, so we need a code that is prefix-free, i.e., no integers has a code which is prefix of another code.
Unary Code
Idea. 1 is for the data, the bit is for delimiter
The code is good only for small integers.
Elias Gamma
Idea. Make decodable by prepending a code for its length. You represent with , but pre-pend to it, with unary, the length of , so we prepend . We can also remove the first 1 from because it’s always 1, it’s superfluous.
I.e.
it’s the , so it’s the unary of the length of in binary.
it’s the binary of without the left-most .
Elias Delta
Idea. same as before but we encode with Elias Gamma
So, instead of representing the length with unary, we represent it with Gamma. This performs better with larger integers.
Variable Byte Code
Idea. Codewords are byte-aligned rather than bit-aligned.
Integers are represented by a byte, where the first bit, it’s a control bit that is used to signal continuation/end of the stream of bytes. The other 7 bits are used to represent the integers.
With short number this worsen the performances.
The 7 bits are represented as
the first is control, then padding then the representation of 6.
It’s extremely fast in decompression.
It uses almost double the space of previous algorithms.