We reduced the latency on a full-text search query from 220ms → 47ms by looking under the hood of Rust’s “zero-cost” iterators to find that they were silently preventing vectorization. This post serves as a reminder that zero-cost abstractions do not absolve you from practicing mechanical sympathy.
avg latency,
filtered BM25 query,
5 million documents
░░░░ 220 ms
░░░░
░░░░
░░░░
░░░░
░░░░
░░░░
░░░░
░░░░
░░░░
░░░░ ▓▓▓▓ 47 ms
░░░░ ▓▓▓▓
░░░░ ▓▓▓▓
══════ ══════
before afterSeveral months ago, one of our customers was struggling with high latency on a
full-text (BM25) query that does a bunch of permissions checks, with up to
thousands of permissions identifiers passed in a ContainsAny filter. Something
like this:
{
"filters": ["attribute", "ContainsAny", ["a", "c", "f", "j", "m", "o", "t", "x", "z", ...]],
"rank_by": ["text", "BM25", "some query string"]
}
Filtered BM25 queries such as this are common on turbopuffer, routinely taking less than 50ms over millions of documents. This one seemed innocuous, yet it was taking over 4× that long. Examining the query profiles revealed something interesting: only about 10ms was being spent on the actual BM25 ranking; the rest (>200ms) was being spent evaluating filters.
Filter evaluation of this kind should be cheap. In fact, our query planner reported the size of the relevant filter bitmaps (representing which documents match each filter value) as only 67MB for this query. Based on some napkin math, reading from an NVMe SSD (at a max throughput of 6,240 MB/s) and processing should only take 10-20ms.
Typically when there's a large gap between napkin math and reality, it means there's either a big opportunity for optimization or a bug in our understanding. So we started digging...
turbopuffer stores all indexed data in a Log-Structured Merge (LSM) tree. Key-value pairs are stored in sorted files (SSTables) on object storage. A single sorted file would make reads simple (low read amplification), but every write would need to re-sort the entire dataset (high write amplification). Instead, new KV pairs are written to small files to minimize write amplification and asynchronously compacted (merged and deduplicated) into larger files to minimize read amplification. LSM compaction keeps writes fast while limiting how many files a read must touch.
Each file contains metadata describing its minimum and maximum keys. To serve a read, the query engine uses this metadata to find the files that could contain matching keys, then performs a byte-range fetch to get the relevant key ranges from the files.
query(i OR j) ─▶ read ─▶ merge
▲▲▲
││└──┐
╔═ LSM tree ═════││═══│═══════════╗
║ │└┐ │ ║
║ ┌──┐ ┌──┐ ┌──┐│╔╧═╗│┌──┐ ┌──┐ ║
║ │ab│ │cd│ │fg││║ij║││mn│ │op│ ║
║ └──┘ └──┘ └──┘│╚══╝│└──┘ └──┘ ║
║ │ │ ║
║ │ ┌─┘ ║
║ ┌──┬──┬──┬──┐ │╔═╧╗──┬──┬──┐ ║
║ │ab│cd│ef│gh│ │║ij║kl│mn│op│ ║
║ └──┴──┴──┴──┘ │╚══╝──┴──┴──┘ ║
║ │ ║
║ │ ║
║ ┌──┬──┬──┬──╔═╧╗──┬──┬──┬──┐ ║
║ │ab│cd│ef│gh║ij║kl│mn│op│rs│ ║
║ └──┴──┴──┴──╚══╝──┴──┴──┴──┘ ║
║ ║
╚═════════════════════════════════╝
Our customer's query uses the ContainsAny
filter, essentially an OR to match all values (e.g.
"a" OR "c" OR "f" OR "j" OR "m" OR "o" OR "t" OR "x" OR "z" OR ...).
Each value in the filter triggers a separate key lookup fanning out across many disjoint key ranges.
query(a OR c OR ─▶ read ─▶ merge
f OR j OR ▲▲▲
m OR o) │││
┌─┘│└─────┐
╔═ LSM tree ═════│══│══════│══════╗
║ ┌────┬────┬────┼───┬────┐ ║
║ ╔═╧╗ ╔═╧╗ ╔═╧╗│╔═╧╗ ╔╧═╗│╔╧═╗ ║
║ ║ab║ ║cd║ ║fg║│║ij║ ║mn║│║op║ ║
║ ╚══╝ ╚══╝ ╚══╝│╚══╝ ╚══╝│╚══╝ ║
║ │ │ ║
║ ┌──┬──┬────────┬────┬─┴┐ ║
║ ╔═╧╦═╧╦═╧╗··┐ │╔═╧╗··╔╧═╦╧═╗ ║
║ ║ab║cd║ef║gh╵ │║ij║kl║mn║op║ ║
║ ╚══╩══╩══╝··┘ │╚══╝··╚══╩══╝ ║
║ │ ║
║ ┌──┬──┬─────┼─────┬──┐ ║
║ ╔═╧╦═╧╦═╧╗··╔═╧╗··╔═╧╦═╧╗··┐ ║
║ ║ab║cd║ef║gh║ij║kl║mn║op║rs╵ ║
║ ╚══╩══╩══╝··╚══╝··╚══╩══╝··┘ ║
║ ║
╚═════════════════════════════════╝
Each returned key range produces an iterator over its KV pairs, and these iterators must be merged into a single, sorted, deduplicated stream for filtering and ranking.
The iterators for all key ranges are passed into a merge iterator, which returns keys in order and keeps the most recent KV pair when duplicate keys occur.
The merge for our customer's ContainsAny query, with thousands of filter
values – each of which might produce a separate key range – will be
comparatively more expensive than that for a query fetching only a few
contiguous key ranges.
Still, the merge itself should be cheap. Our merge iterator just compares keys
and moves the winning entry forward – relatively light work. It's built on
Rust's standard Iterator trait, a zero-cost abstraction, so it should compile
down to the same code as a hand-written loop with no additional overhead. In
nearly every other query, it was negligible. But the profile for this particular
customer's query clearly showed the merge consuming over 60% of runtime.
To try to find the hidden cost, let's look at a simplified Rust skeleton of one
of the core parts of our merge iterator code. The scan() function builds the
merge from all candidate key range iterators (here just slices of integers with
a position), then drains it in a loop, with each call to next() recursing down
through the merge structure to produce one value at a time. In production, the
loop body evaluates filters and scores results. Here, we substitute trivial
arithmetic to try to isolate the iterator cost.
fn scan(keys: &[Vec<u64>]) -> u64 {
let mut merge_tree = Merge::build(keys);
let mut sum = 0u64;
// next() recurses down to produce the smallest current key
while let Some(v) = merge_tree.next() {
// Simple arithmetic: sum if even
let x = v + 1;
if x % 2 == 0 { sum += x; }
}
sum
}
If we scan 100,000 integers on a modern 3 GHz CPU, we would expect it to take
~130μs (100K values × 4 instructions each ÷ 3 billion cycles/sec), assuming only
scalar operations. In reality, it takes 6.5ms! 50× more expensive than it should
be if the iterator is truly "zero-cost".
At first glance, nothing in Rust explains the cost. The work is occurring purely
in memory, so we can eliminate I/O from consideration, yet something is clearly
preventing the CPU from ripping through these integers. Every engineer knows
that computers are a tower of lies abstractions, so we need to go below the
abstraction to look at what the compiler is actually generating.
A zero-cost abstraction promises that we couldn't hand-write faster assembly for
the same logic. Let's look at the ARM assembly for scan() to see how it
compiles:
LBB10_1: ; loop {
sbfx x8, x1, #0, #1 ;
add x9, x1, #1 ; x = v + 1
and x8, x8, x9 ; x = (x is even) ? x : 0
add x19, x8, x19 ; sum += x
add x0, sp, #8 ;
bl __ZN14bench9ChainTree4next17haad780538028036aE ; Some(v) = tree.next() else { break }
tbnz w0, #0, LBB10_1 ; }
The Rust compiler (LLVM) has produced a loop with just seven instructions: the first four are the loop body, the last three the loop control.
The loop control calls next() and exits when it returns None, otherwise
continuing:
add x0, sp, #8 ; pass &mut self to next()
bl __ZN14bench9ChainTree4next17haad780538028036aE ; tree.next()
tbnz w0, #0, LBB10_1 ; continue if Some
Within the loop body, LLVM has been quite clever with our conditional
arithmetic; instead of evaluating if / branch, it uses a bitmasking trick to
simply add 0 to the sum if the value is odd.
sbfx x8, x1, #0, #1 ; mask: 0xFFFF..FFFF if v is odd (x is even), 0x0 if v is even
add x9, x1, #1 ; x = v + 1
and x8, x8, x9 ; x = x & mask (x is even ? x : 0)
add x19, x8, x19 ; sum += x
This is a standard compiler technique, called predication, that avoids the cost of branch misprediction by turning a conditional into pure arithmetic.
Thanks to LLVM, the work inside the loop body compiles down to just four arithmetic instructions per element; it should take ~130µs based on the napkin math. So why is it taking 50× longer?
The compiler wants to unroll iterations into larger blocks and vectorize with SIMD (Single Instruction, Multiple Data) instructions to process multiple values in parallel.
But it can't do any of that here because of the recursive nature of next().
Each call to next() compares across iterators, mutates internal positions, and
returns one value. The compiler can't predict what a subsequent call will return
until the current one finishes, because each call's output depends on the state
left behind by the previous one. That kills unrolling (the compiler can't batch
calls it can't predict) and vectorization (can't process multiple values if they
arrive one at a time).
Herein lies the hidden opportunity cost of Rust's zero-cost abstraction in our
merge iterator. The iterator itself compiles down to the code you'd write by
hand for a single call. In that sense, it is zero-cost. But the abstraction
also prevents the compiler from vectorizing or unrolling across calls, so even
though the loop body is exactly the kind of dumb and serial work modern CPUs
devour, the compiler can't vectorize it. And the recursive calls to next()
bury 130µs of useful work inside 6.5ms of merge overhead.
For a query like our customer's ContainsAny, whose many disjoint key ranges
produce many iterators, the merge cost dominates.
To eliminate the opportunity cost, we need to give back to the compiler a tight loop over contiguous data, with no recursive function calls in the middle.
The solution? A classic database technique called batched iterators.
Instead of producing a single element per call, the merge iterator now fills a batch by comparing and interleaving KV pairs from its inputs, then returns the entire batch at once. The consumer processes that batch as a plain array in a tight loop, with no recursive calls in the middle.
Here's what that looks like in an updated Rust skeleton:
fn scan(keys: &[Vec<u64>]) -> u64 {
let mut merge_tree = Merge::build(keys);
let mut sum = 0u64;
// next_batch() recurses down, but produces a batch (N=512) of KV pairs
while let Some(batch) = merge_tree.next_batch() {
// process the array
for val in batch {
let x = val + 1;
if x % 2 == 0 { sum += x; }
}
}
sum
}
Batching splits the work into two loops:
The merge cost doesn't disappear, but it is now amortized across 512 KV pairs rather than paid on every one. And since the inner loop is just walking a plain array, the compiler can see the entire loop body and aggressively optimize it by unrolling iterations and vectorizing with SIMD instructions.
The assembly confirms the optimization with the key difference in the inner loop
(LBB11_4). Whereas the original loop compiled to just seven scalar
instructions, the batched version processes 8 values at a time using SIMD
instructions (those ending in .2d and .16b):
LBB11_4: ; loop {
add x9, x21, x8 ;
ldp q4, q5, [x9, #16] ; let vals = batch[i..i+8];
ldp q6, q7, [x9, #48] ;
and.16b v16, v4, v20 ; let mask = vals & 1
cmeq.2d v16, v16, #0. ; mask = (vals & 1 == 0) ? all-ones : all-zeros
add.2d v4, v4, v20 ; let xs = vals + 1
bic.16b v4, v4, v16 ; let xs = xs & mask
add.2d v1, v4, v1 ; sum += xs
add x8, x8, #64 ; i += 8;
cmp x8, #1, lsl #12 ; if i == batch.len() { break }
b.ne LBB11_4 ; }
With this change, our scan benchmark (100,000 values) runs in ~110μs, 60×
faster than before and even beating our 130μs napkin math estimate thanks to
SIMD.
merge execution,
100k values, 64 blocks
░░░░ 6.5 ms
░░░░
░░░░
░░░░
░░░░
░░░░
░░░░
░░░░
░░░░
░░░░
░░░░
░░░░
░░░░
░░░░ ▓▓▓▓ 110 μs
══════ ══════
before after"Zero-cost" promises that an abstraction compiles away, not that it has no side effects. Rust's iterator compiled each call to code similar to what you'd write by hand, but the abstraction boundary between calls hid the loop shape the compiler needs to apply SIMD and unrolling across calls.
After we updated our production code to use batched iterators, our customer's
query latency dropped from ~220ms to 47ms. Their ContainsAny filter still fans
out across many key ranges, but the cost of the merge is now amortized across
batches with SIMD vectorization over each batch.
This is not a new problem, and the solution isn't new either; batched iterators are a decades-old database technique. But it required that we look below the abstraction and have the mechanical sympathy to see what the CPU actually needed. Modern CPUs will only get faster at doing dumb and serial work such as this. Our latest full-text search and approximate nearest neighbor algorithms embody this. At the end of the day, production profiles trump theory and abstraction. At turbopuffer we continuously pore over them to squeeze more performance out of the database.
turbopuffer is a fast search engine that hosts 2.5T+ documents, handles 10M+ writes/s, and serves 10k+ queries/s. We are ready for far more. We hope you'll trust us with your queries.
Get started