Fast Semantic Search Brain Dump
Some temporary project notes until I know what to do with this. I’ll turn this into a proper blog post soon.
How to Efficiently Search Vectors
Embedding Model
- Probably something like
all-MiniLM-L6-v2 - Load into memory quantized maybe to
int8low memory overhead and no cold starts - Model has 22.7M parameters
- Model size with
int8quantization:
That seems very very good for an in-memory embedding model.
pgvector stuff
pgvector uses two primary indexing strategies:
- HNSW (Hierarchical Navigable Small World - Graph Theory and Small World Networks)
- IVFFlat (Inverted File with Flat Compression)
Probably best to benchmark the two methods.
HNSW
Key Idea: Graph-based indexing that relies on “small world networks”, where most nodes can be reached from any other node by traversing a small number of edges. HSNW expands on this by using a hierarchical structure to improve search efficiency.
NSW (Navigable Small World)
- Nodes are connected by edges. Each neighboring node is called a friend
- When querying: start at random entrypoint and greedily traverse the graph to find the closest nodes to the query vector
HSNW is like combining probability skip lists with NSW.
Layers (top to bottom)
- Layer (entry layer): Nodes with the longest edges similar to skip lists. Each node more likely to be higher degree because it has the longest edges.
- Layer : Nodes with the second longest edges similar to skip lists. Each node more likely to be higher degree because it has the second longest edges.
- Layer 0: All vectors in dataset, highly connected graph. Each node is connected to its nearest neighbors by Euclidean distance.
Traverse each edge in a layer greedily, calculating the distance between query vector and each node. If its a local minima, traverse the next layer.
Each layer gives us a new local minima, so a search in a new layer has an entrypoint of the closest node to the query vector based on the previous layer. Worst case is we hit layer 0.
Higher memory overhead because we need to store the graph structure for each layer (e.g. links, distance, etc.)
IVFFlat
- Uses -means clustering to divide vectors into clusters (aka Voronoi cells)
- “Inverted File”: Each cluster’s centroid maps back to the vectors it contains
- Quantization-based indexing (“buckets” of vectors): Suppose vector is 1536 dimensions, DB doesn’t need to compare all dimensions, only the quantized/discretized values cheaper lookups
- Search: Embed the query cosine similarity on the quantized values (e.g. centroids) cheaper lookups, faster retrieval:
- Using -means: lookups are
- Looking up every vector:
- Way faster if , which is very likely
Thoughts:
- -means is non-parametric so insertions have a huge memory footprint that scales with data good for static data
Side note:
pgvectorusesfp32, maybe usehalfvectype forfp16. Compare benchmark storage footprints here?
Matryoshka Representation Learning (MRL)
- Most semantic info encoded in first few dimensions, becoming more granular as we go deeper (kinda similar to SVD expansions)
- Instead of indexing full dimension of vector, create index of “slice” of vector (sub-vector)
- Index by these sub-vector can use top vectors so that search is significantly faster
Binary Quantization
- Map components of each vector to either (negative values) or (positive values)
- Seems pretty extreme since just a XOR operation
- Use hamming distance to search: number of components that differ between query vector and other vectors
- To search, take with lowest hamming distance
- Good for 2-stage retrievals:
- Approximate using binary quantization of top vectors
- Re-rank using cosine similarity between query vector and vectors
Roofline Model
Model to represent theoretical limit of achievable performance within a computer system between CPU speeds and memory bandwidth, represented by:
- If AI is high more compute for less memory bandwidth transferred (compute-bound)
- If AI is low more memory bandwidth transferred for less compute (memory-bound)
Suppose embedding model is 22.7 megabytes, after using int8 quantization.
The model is loaded into memory and so the time it takes to load the model is:
Using a MacBook Pro with an M4 Pro has memory speeds of:
Then time to load the model is:
So loading the model should only take milliseconds 🤯 if using bare metal or C++.
Maybe hit Amdahl’s Law. Binary quantization with 2 step retrieval actually takes longer than embedding + keyword search with rrf.
~11ms roughly embedding and query table on db (no fastapi)