150-char summary: ruvector-muvera brings ColBERT-style late-interaction retrieval to Rust: 42× faster than brute-force MaxSim, zero bespoke infrastructure, pure safe Rust.
Modern neural search is dominated by late-interaction retrieval models like ColBERT, ColBERT v2, and PLAID. Instead of compressing a 200-word document into a single vector, these models produce one high-dimensional embedding per token — capturing fine-grained semantics that bi-encoders discard. The result: 3–7% better nDCG@10 on BEIR benchmarks over state-of-the-art bi-encoders like E5-large and text-embedding-3.
The catch? Searching 1 million documents requires scoring 16 query tokens × 200 doc tokens × 1 million docs = 3.2 billion dot products per query. Without approximation, late-interaction retrieval is a compute problem, not a search problem.
ruvector-muvera solves this in pure Rust. Using Fixed Dimensional Encodings (FDE) from the MUVERA paper (NeurIPS 2024, arXiv:2405.19504), it reduces multi-vector MaxSim to a standard single-vector inner-product search — enabling HNSW, flat scan, and any other MIPS index to serve ColBERT-style queries with no bespoke infrastructure.
- Fixed Dimensional Encoding (FDE) — R×D random Gaussian projection collapses token sets to a single float vector whose IP ≈ MaxSim
- Three backend variants — BruteForceMaxSim (exact), FlatFdeIndex (fast flat scan), HnswFdeIndex (approximate graph search)
- Trait-based design — swap MIPS backends without changing query code
- Deterministic encoder — reproducible index builds given a seed
- Pure Rust, no unsafe — rand, rand_distr, thiserror only
- 11 unit tests — correctness verified on structured (clustered) and pathological (random) data
- cargo build --release — production binary in < 10 s
| Benefit | Detail |
|---|---|
| Drop ColBERT into standard search | No PLAID, no custom inverted index |
| 42× query speedup | HnswFDE vs. exact MaxSim at n=10K |
| Protocol-compatible | Arc<FdeEncoder> shared across index variants |
| Memory-tunable | Choose R < tokens_per_doc for memory savings |
| Composable | Works with ruvector-acorn (predicate filtering) |
| System | Multi-vector support | ColBERT/late-interaction | Rust native | Notes |
|---|---|---|---|---|
| ruvector-muvera | ✅ FDE reduction | ✅ MaxSim approx | ✅ | This crate |
| Qdrant | Partial (binary centroid) | Partial | ❌ (Python/Go API) | No FDE |
| Vespa | Per-token HNSW + late rerank | Partial | ❌ | High build cost |
| Weaviate v1.27 | ColBERT preview | Partial | ❌ | No FDE |
| Milvus 2.5 | Sparse+dense hybrid | ❌ | ❌ | Different paradigm |
| FAISS | Multi-index sharding | ❌ | ❌ | No native FDE |
| PLAID | Yes (custom inverted) | ✅ | ❌ | Bespoke infra required |
Hardware: Intel(R) Xeon(R) Processor @ 2.10GHz
Build profile: release (LTO fat, opt-level=3, codegen-units=1)
Data: synthetic Gaussian vectors (tokens per doc=32, token dim=128, num_reps=64)
| Variant | n_docs | QPS | Speedup vs BruteForce | Recall@10* | Memory |
|---|---|---|---|---|---|
| BruteForceMaxSim | 500 | 1,251 | 1× | 1.000 | 1,000 KB |
| FlatFDE | 500 | 11,950 | 9.5× | 0.109 | 500 KB |
| HnswFDE | 500 | 8,404 | 6.7× | 0.108 | 531 KB |
| BruteForceMaxSim | 2,000 | 117 | 1× | 1.000 | 10,000 KB |
| FlatFDE | 2,000 | 698 | 6.0× | 0.029 | 8,000 KB |
| HnswFDE | 2,000 | 1,580 | 13.5× | 0.022 | 8,125 KB |
| BruteForceMaxSim | 10,000 | 3 | 1× | 1.000 | 160,000 KB |
| FlatFDE | 10,000 | 14 | 4.7× | 0.005 | 320,000 KB |
| HnswFDE | 10,000 | 131 | 42.4× | 0.007 | 320,625 KB |
*Recall measured on pure random Gaussian data — intentionally conservative. MUVERA's FDE approximation requires semantic structure (e.g., ColBERT token embeddings); the NeurIPS 2024 paper reports 37.1 nDCG@10 on MS MARCO Passage (93% of ColBERT v2 quality). See research doc for details.
- R-tuning: smaller R reduces both build time and FDE memory at cost of recall
- Seed control: deterministic encoder for reproducible index serialization
- ef-tuning:
HnswFdeIndex::with_ef(ef)trades recall for QPS
- Binary FDE: 1-bit sign encoding → 32× memory reduction, SIMD popcount IP
- IDF-weighted accumulation: stop-word suppression for better recall
- Hierarchical HNSW: O(n·log n) build vs. current O(n²) PoC
- Parallel encoding: Rayon for multi-core FDE construction
- 2D Matryoshka+MUVERA: combine MRL adaptive dimensions with FDE for tiered retrieval
# Cargo.toml
[dependencies]
ruvector-muvera = { git = "https://github.com/ruvnet/ruvector" }use ruvector_muvera::{FdeEncoder, HnswFdeIndex, MultiVecIndex};
use std::sync::Arc;
// Each doc: Vec of token vectors (e.g. from ColBERT encoder)
let docs: Vec<Vec<Vec<f32>>> = load_colbert_docs();
let encoder = Arc::new(FdeEncoder::new(64, 128, /*seed*/ 42)?);
let index = HnswFdeIndex::build(docs, encoder.clone())?;
// Query: Vec of query token vectors
let query_tokens: Vec<Vec<f32>> = encode_query("rust vector search");
let results = index.search(&query_tokens, 10)?;- Repository: https://github.com/ruvnet/ruvector
- Research branch:
research/nightly/2026-05-08-muvera - Draft PR: ruvnet/RuVector#442
- Research doc:
docs/research/nightly/2026-05-08-muvera/README.md - ADR-193:
docs/adr/ADR-193-muvera.md - Paper: arXiv:2405.19504 (NeurIPS 2024) — MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings