Show HN: TurboQuant for vector search – 2-4 bit compression

(github.com)

65 points | by justsomeguy1996 5 days ago

4 comments

antirez 1 hour ago
This is very similar to what I stated here: https://x.com/antirez/status/2038241755674407005
That is, basically, you just rotate and use the 4 bit centroids given that the distribution is known, so you don't need min/max, and notably, once you have that, you can multiply using a lookup table of 256 elements when doing the dot product, since two vectors have the same scale. The important point here is that for this use case it is NOT worth to use the 1 bit residual, since for the dot product, vector-x-quant you have a fast path, but quant-x-quant you don't have it, and anyway the recall difference is small. However, on top of that, remember that new learned embeddings tend to use all the components in a decent way, so you gain some recall for sure, but not as much as in the case of KV cache.
[-]
- justsomeguy1996 57 minutes ago
  I think the main benefits are:
  - Slightly improved recall
  - Faster index creation
  - Online addition of vectors without recalibrating the index
  The last point in particular is a big infrastructure win I think.
pidtom 3 hours ago
I built TurboQuant+ (https://github.com/TheTom/llama-cpp-turboquant), the llama.cpp implementation of this paper with extensions: asymmetric K/V compression, boundary layer protection, sparse V dequant, and this week weight compression (TQ4_1S) that shrinks models 28-42%% on disk with minimal quality loss. 5k+ stars, 50+ community testers across Metal, CUDA, and AMD HIP.
Cool to see the same WHT + Lloyd-Max math applied to vector search. The data-oblivious codebook property is exactly what makes it work for online KV cache compression too. No calibration, no training, just quantize and go.
If anyone is running local LLMs and wants to try it: https://github.com/TheTom/turboquant_plus/blob/main/docs/get...
richardjennings 1 hour ago
You can take DiskANN using OPQ and Vamana and get near zero indexing time and better recall using TurboQuant ... Nice !
justsomeguy1996 5 days ago
I built a Python implementation of Google's TurboQuant paper (ICLR 2026) for vector search. The key thing that makes this different from PQ and other quantization methods: it's fully data-oblivious. The codebook is derived from math (not trained on your data), so you can add vectors online without ever rebuilding the index. Each vector encodes independently in ~4ms at d=1536.
The repo reproduces the benchmarks from Section 4.4 of the paper — recall@1@k on GloVe (d=200) and OpenAI embeddings (d=1536, d=3072). At 4-bit on d=1536, you get 0.967 recall@1@1 with 8x compression. At 2-bit, 0.862 recall@1@1 with ~16x compression.
Paper: https://arxiv.org/abs/2504.19874