While graph-based methods like HNSW construct sophisticated neighborhood graphs for searching, the Inverted File Index (IVF) approach takes a different route, borrowing principles from traditional information retrieval and clustering. IVF partitions the entire vector space into distinct regions, or cells, and focuses the search effort only on a few relevant regions, dramatically reducing the number of distance calculations needed.
At its core, IVF relies on clustering. During the index construction phase:
- A representative subset of the database vectors is often sampled.
- A clustering algorithm, typically k-means, is run on this sample (or the full dataset) to identify k cluster centroids. These centroids act as representatives for different regions of the vector space.
- The entire dataset is then scanned. Each vector v is assigned to its nearest centroid ci. This creates k "inverted lists" (or buckets, cells), where each list Li contains the identifiers (and potentially the vectors themselves) of all data points closest to centroid ci.
Structure of an IVF index. Database vectors (circles) are assigned to their nearest centroid (diamonds, C1-C3), forming inverted lists (L1-L3). A query vector (star) is first compared to centroids; then, only the lists corresponding to the nearest centroids (e.g., L2 if nprobe=1
) are searched.
When a query vector q arrives, the search proceeds as follows:
- The query vector q is compared against the k centroids.
- The
nprobe
nearest centroids to q are identified. nprobe
is a crucial tuning parameter, typically a small integer (e.g., 1, 5, 10).
- The search is restricted to only the vectors contained within the inverted lists corresponding to these
nprobe
centroids.
- Within these selected lists, a more precise distance calculation (or even an exact k-NN search) is performed between the query q and the vectors in those lists to find the final approximate nearest neighbors.
The effectiveness of IVF hinges on the quality of the clustering and the choice of nprobe
. If the centroids partition the space well, searching only a few lists significantly prunes the search space. However, increasing nprobe
improves the chances of finding the true nearest neighbors (higher recall) but increases the search time, as more lists need to be scanned.
IVF Variations: Addressing Memory and Speed
The basic IVF approach, often called IVFFlat, stores the full, original vectors within each inverted list. While simple, it can still be memory-intensive if the vectors are high-dimensional, and the search within the selected lists can be slow if those lists are large. To mitigate these issues, IVF is often combined with vector compression techniques, primarily Product Quantization (PQ), which we explore in detail later in this chapter.
IVFPQ: Combining IVF with Product Quantization
The most common and impactful variation is IVFPQ. Here's how it modifies the basic IVF structure:
- Index Construction: After assigning vectors to their nearest centroids (like IVFFlat), the vectors within each list are compressed using Product Quantization. Instead of storing the full vectors, only their compact PQ codes are stored in the inverted lists. The original vectors might be stored elsewhere if needed for re-ranking or retrieval, but the index itself only holds the codes.
- Search:
- The query q still finds the
nprobe
nearest centroids based on distances to the original centroid vectors.
- When searching within the selected lists Li, the distance calculations are performed between the original query vector q and the compressed PQ codes of the database vectors stored in those lists. This is known as Asymmetric Distance Computation (ADC). It's "asymmetric" because we compare a full vector (the query) to coded vectors (database items).
- ADC using PQ codes is significantly faster than calculating distances between full vectors and requires much less memory to store the index.
The benefits of IVFPQ are substantial:
- Reduced Memory Footprint: PQ dramatically shrinks the size of the vectors stored in the index (e.g., from hundreds of floats to tens of bytes).
- Faster Search within Lists: Calculating distances using PQ codes (ADC) is computationally cheaper than using original vectors.
However, this comes at the cost of accuracy. Quantization introduces approximation errors. The distances calculated using ADC are estimates of the true distances. This means IVFPQ might miss some true nearest neighbors that IVFFlat would have found within the probed lists.
Choosing Between IVFFlat and IVFPQ
- IVFFlat: Preferable when memory is less constrained and higher accuracy (within the probed cells) is desired. Simpler to implement if PQ is not already part of the pipeline. The bottleneck remains the number of full-vector distance calculations within the probed lists.
- IVFPQ: Essential for very large datasets where memory usage is a primary concern. Offers significantly faster search speeds within the probed cells due to ADC. The main trade-off is the loss in accuracy introduced by PQ compression. Tuning the PQ parameters (number of subquantizers, bits per subquantizer) becomes important alongside
k
and nprobe
.
Other Considerations
- IVFADC: Often used interchangeably with IVFPQ, specifically highlighting the Asymmetric Distance Computation step during the search within lists. Less common is Symmetric Distance Computation (SDC), where the query vector is also quantized before comparison, which is even faster but typically less accurate than ADC.
- Training: Both IVFFlat and IVFPQ require a training phase involving k-means clustering to find the initial centroids. This training needs a representative dataset and can take time for large k or large datasets.
- Parameter Tuning: Selecting the optimal number of centroids (k) and the number of probes (
nprobe
) is critical.
- A small k leads to fewer, larger lists. Searching might be faster if
nprobe
is small, but lists might be too large for efficient internal search.
- A large k leads to many small lists. Finding the nearest centroids takes longer, but searching within lists is faster. Requires a larger
nprobe
to maintain recall.
nprobe
directly controls the trade-off between search speed and recall.
IVF variations, especially IVFPQ, represent a powerful and widely used family of ANN algorithms. They offer a different set of trade-offs compared to graph-based methods like HNSW, particularly excelling in memory-constrained environments and offering tunable performance through parameters like k and nprobe
. Understanding how they partition the space and optionally compress vectors is fundamental to building large-scale vector search systems.