That is, basically, you just rotate and use the 4 bit centroids given that the distribution is known, so you don't need min/max, and notably, once you have that, you can multiply using a lookup table of 256 elements when doing the dot product, since two vectors have the same scale. The important point here is that for this use case it is NOT worth to use the 1 bit residual, since for the dot product, vector-x-quant you have a fast path, but quant-x-quant you don't have it, and anyway the recall difference is small. However, on top of that, remember that new learned embeddings tend to use all the components in a decent way, so you gain some recall for sure, but not as much as in the case of KV cache.
I built TurboQuant+ (https://github.com/TheTom/llama-cpp-turboquant), the llama.cpp implementation of this paper with extensions: asymmetric K/V compression, boundary layer protection, sparse V dequant, and this week weight compression (TQ4_1S) that shrinks models 28-42%% on disk with minimal quality loss. 5k+ stars, 50+ community testers across Metal, CUDA, and AMD HIP.
Cool to see the same WHT + Lloyd-Max math applied to vector search. The data-oblivious codebook property is exactly what makes it work for online KV cache compression too. No calibration, no training, just quantize and go.
I built a Python implementation of Google's TurboQuant paper (ICLR 2026) for vector search. The key thing that makes this different from PQ and other quantization methods: it's fully data-oblivious. The codebook is derived from math (not trained on your data), so you can add vectors online without ever rebuilding the index. Each vector encodes independently in ~4ms at d=1536.
The repo reproduces the benchmarks from Section 4.4 of the paper — recall@1@k on GloVe (d=200) and OpenAI embeddings (d=1536, d=3072). At 4-bit on d=1536, you get 0.967 recall@1@1 with 8x compression. At 2-bit, 0.862 recall@1@1 with ~16x compression.
That is, basically, you just rotate and use the 4 bit centroids given that the distribution is known, so you don't need min/max, and notably, once you have that, you can multiply using a lookup table of 256 elements when doing the dot product, since two vectors have the same scale. The important point here is that for this use case it is NOT worth to use the 1 bit residual, since for the dot product, vector-x-quant you have a fast path, but quant-x-quant you don't have it, and anyway the recall difference is small. However, on top of that, remember that new learned embeddings tend to use all the components in a decent way, so you gain some recall for sure, but not as much as in the case of KV cache.
- Slightly improved recall
- Faster index creation
- Online addition of vectors without recalibrating the index
The last point in particular is a big infrastructure win I think.
Cool to see the same WHT + Lloyd-Max math applied to vector search. The data-oblivious codebook property is exactly what makes it work for online KV cache compression too. No calibration, no training, just quantize and go.
If anyone is running local LLMs and wants to try it: https://github.com/TheTom/turboquant_plus/blob/main/docs/get...
The repo reproduces the benchmarks from Section 4.4 of the paper — recall@1@k on GloVe (d=200) and OpenAI embeddings (d=1536, d=3072). At 4-bit on d=1536, you get 0.967 recall@1@1 with 8x compression. At 2-bit, 0.862 recall@1@1 with ~16x compression.
Paper: https://arxiv.org/abs/2504.19874