Skip to content

ANN Indexing

HNSW: Vector search for High Dimensional datasets

Approximate Nearest Neighbor (ANN) search is a method for finding data points near a given point in a dataset, though not always the exact nearest one. HNSW is one of the most accurate and fastest Approximate Nearest Neighbour search algorithms, It’s beneficial in high-dimensional spaces where finding the same nearest neighbor would be too slow and costly.

There are three main types of ANN search algorithms:

  1. Tree-based search algorithms: Use a tree structure to organize and store data points.
  2. Hash-based search algorithms: Use a hash table to store and manage data points.
  3. Graph-based search algorithms: Use a graph structure to store data points, which can be a bit complex. HNSW is a graph-based algorithm, that will be broken down into smaller parts to make it easier to understand how it works.

All graph-based search algorithms rely on the idea of a proximity graph, where the graph is built based on the proximity of data points, measured by their Euclidean distances. Jumping straight to HNSW might be complicated, so first, we'll explain two important algorithms that help understand HNSW: the Skip List and Navigable Small World (NSW) Graphs, which are the predecessors of HNSW.

IVFPQ: Accelerate vector search by creating indices

Vector similarity search is finding similar vectors from a list of given vectors in a particular embedding space. It plays a vital role in various fields and applications because it efficiently retrieves relevant information from large datasets.

Vector similarity search requires excessive memory resources for efficient search, especially when dealing with dense vector datasets. Here comes the role of compressing High Dimensional vectors for optimizing memory storage. In this blog, We’ll discuss about

  1. Product Quantization(PQ) & How it works
  2. Inverted File Product Quantization(IVFPQ) Index
  3. Implementation of IVFPQ using LanceDB

We’ll also see the performance of PQ and IVFPQ in terms of memory and cover an implementation of the IVFPQ Index using LanceDB.

Quantization is a process used for dimensional reduction without losing important information.