FTS v2, the newest version of turbopuffer's homegrown text search engine, is up to 20x faster than v1 thanks to (a) an improved storage layout and (b) a better search algorithm. This post is about the better search algorithm.
turbopuffer is often queried by agents, who tend to craft longer queries than humans. A key characteristic of the FTS v2 search algorithm is that it performs very well on such long queries (tens of terms). In particular, it is up to several times faster than block-max WAND, the most popular algorithm for lexical search.
Below are some representative benchmarks (full results) of FTS v1 versus FTS v2 on a 5M-document Wikipedia export sample dataset borrowed from Quickwit's Search Benchmark Game. The results support the claim that FTS v2 works extremely well not only on simple queries, but also on harder queries with many terms or those with only stopwords.
turbopuffer FTS latency (ms)
k=100,
English Wikipedia export
~5M documents
174ms
░░ v1 ░░
░░
▓▓ v2 ░░
░░
░░
░░
░░
░░
░░
░░
░░
░░
░░
░░
75ms ░░
░░ ░░
░░ ░░
57ms ░░ ░░
░░ ░░ ░░
░░ ░░ ░░
░░ ░░ ░░
░░ 20ms ░░ ░░20ms
░░ ░░ ░░ ░░▓▓
8ms ░░7ms ░░ ░░6ms ░░▓▓
░░3ms ░░▓▓ ░░5ms ░░▓▓ ░░▓▓
░░▓▓ ░░▓▓ ░░▓▓ ░░▓▓ ░░▓▓
══════════════════════════════
q1 q2 q3 q4 q5
--------------------------------
q1 = san francisco
q2 = the who
q3 = united states
constitution
q4 = lord of the rings
q5 = pop singer songwriter
born 1989 won best
country song time
person of the year
Consider the query "new york" with a top-k of 10. It has two terms: "new" and "york", each contributing a max score of 2.0 and 5.0, respectively. "york" has a higher score because the word is less common than "new". The query score is the sum of the per-term scores, higher score is better.
We can deduce that any document may get scores up to the following values depending on what terms it contains:
| Document | Max possible score |
|---|---|
| only "new" | 2.0 |
| only "york" | 5.0 |
| "new" and "york" | 7.0 |
During evaluation, we may find ourselves in a scenario where the 10th best-scoring document so far has a score of 4.0. As k=10, any document containing only "new" can never enter the results! At this point, we can skip all documents that only contain "new". This simple optimization can lead to dramatic speedups.
There are two main algorithms that take advantage of this idea to speed up query evaluation without trading recall: MAXSCORE and WAND.
Let's look at how they evaluate a query on terms "new", "york", and "population" which respectively contribute up to 2.0, 5.0, and 4.0 to the score (top_k=3 for brevity).
The first family of search algorithms is
MAXSCORE.
It starts by sorting the query terms by their score contribution:
[new(2.0), population(4.0), york(5.0)].
Then, it starts iterating through the documents that match each term, keeping track of the highest scoring documents in the top-k heap.
Let’s add a proverbial breakpoint a few iterations in:
----------------------------------------------------------
| Term | Max score | Postings (matching doc IDs) |
|--------------|-----------|-----------------------------|
| new | 2.0 | 0, 1, 3, 5, 6, 7, 9, 10 |
| | | ^ |
| population | 4.0 | 1, 2, 3, 8, 9 |
| | | ^ |
| york | 5.0 | 2, 3, 7, 10 |
| | | ^ |
----------------------------------------------------------
# (score, doc_id) -- bigger score is better
TOPK3_HEAP = [(6.0, 1), (9.0, 2), (11.0, 3)]
At this point, the minimum score in the heap is 6.0, so documents with “new” alone can no longer qualify, and “new” has become a non-essential term. We can simply ignore it when looking for the next document to evaluate, speeding up query performance.
In a few iterations, the minimum score will be 7.0, at which point “population” also becomes non-essential, speeding up the remainder of the query execution even further!
MAXSCORE: use essential terms to find candidates, use all terms to compute scores
The second family of skipping algorithms is WAND ("weak and"). We consider it document-centric, because instead of continuously determining which terms no longer qualify, it attempts to find the next doc ID that could potentially qualify for the top-k heap.
Let's breakpoint at the same moment as with MAXSCORE:
----------------------------------------------------------
| Term | Max score | Postings (matching doc IDs) |
|--------------|-----------|-----------------------------|
| new | 2.0 | 0, 1, 3, 5, 6, 7, 9, 10 |
| | | ^ |
| population | 4.0 | 1, 2, 3, 8, 9 |
| | | ^ |
| york | 5.0 | 2, 3, 7, 10 |
| | | ^ |
----------------------------------------------------------
# (score, doc_id) -- bigger score is better
TOPK3_HEAP = [(6.0, 1), (9.0, 2), (11.0, 3)]
WAND sorts doc IDs in ascending order and calculates the score upper bound for each ID. A term contributes to the document score if its iterator has passed and confirmed the doc ID is present, or hasn't yet reached the doc ID (WAND must assume it is in the posting list).
For example:
# (doc ID, score upper bound)
[
(5, 2.0), # matches "new", will not match "york" or "population"
(7, 7.0), # matches "york", might match "new"
(8, 11.0) # matches "population", might match "new" and "york"
]
Then, WAND finds the first document whose best possible score exceeds the minimum heap score (6.0). Here, that's doc ID 7, so WAND evaluates it next. If the minimum heap score were greater than 7.0, WAND would skip directly to 8.
WAND: use all terms, continuously skipping doc IDs that cannot possibly qualify.
In their basic forms, these algorithms use global max scores: each term has a single max score (like 2.0 for "new") that applies to its entire posting list containing all doc IDs. With a global max score, we must evaluate each document individually to determine if it can qualify for the heap. But what if we could skip entire groups of documents at once?
This is what the block-max variants of MAXSCORE and WAND achieve. Block-max divides each posting list into fixed-size blocks containing a subset of doc IDs, storing a local max score for each block (the maximum score any document in that block can get from that term). Documents within a block can have different scores because term frequency (TF) varies per document, even though inverse document frequency (IDF) is constant for the term. If a block's local max score doesn't exceed the minimum heap score, the algorithm can skip the entire block of documents. Block-max MAXSCORE and block-max WAND are currently considered the state of the art for lexical search.
There are really no disadvantages to block-max MAXSCORE/WAND versus MAXSCORE/WAND. For brevity, I'll be using "MAXSCORE" and "WAND" to describe the algorithms for the remainder of this post, but know I'm referring to their block-max counterparts.
WAND is currently the dominant algorithm for query evaluation, as it takes more information into account (the current positions of the postings list iterators) and can thus skip more documents than MAXSCORE.
However, skipping more documents doesn't necessarily mean faster. In order to skip more documents, WAND must spend more cycles computing the next doc ID that can qualify for the top-k heap. This introduces more work per evaluated document, giving WAND a lower throughput than MAXSCORE. When WAND's skipping power is outweighed by low throughput, MAXSCORE becomes more performant.
The Apache Lucene project initially started with WAND. While it worked well for most queries, users reported certain query types where it was slower than even exhaustive evaluation due to poor skipping decisions compounded by poor throughput. This led Lucene to switch to MAXSCORE and optimize it for throughput by improving memory locality, reducing unpredictable branches, and taking advantage of SIMD instructions.
| Algorithm | Skipping | Evaluation throughput |
|---|---|---|
| Exhaustive | None | Very good |
| WAND | Very good | Average |
| MAXSCORE | Good | Good |
| Lucene's vectorized MAXSCORE | Good | Very good |
Lucene's vectorized MAXSCORE throughput is so good that it's also extremely competitive with WAND on simple queries where WAND is expected to shine, as confirmed by the Tantivy vs. Lucene benchmarks (Tantivy implements WAND). And because throughput is very good, it doesn't become slower than exhaustive search in cases when skipping is hard.
This good throughput is especially important on long queries (tens of terms) where skipping is harder, especially for WAND-based algorithms whose overhead scales with the number of terms. Our evaluations confirmed those made by Apache Lucene; this algorithm often performs several times faster than WAND on long queries.
Side note: the WAND vs. MAXSCORE trade-off has similarities with HNSW vs. SPFresh, the system turbopuffer uses for ANN search. WAND and HNSW optimize for skipping first, and throughput second, whereas Lucene's MAXSCORE and SPFresh optimize for throughput first, and skipping second.
While building on the same skipping logic as MAXSCORE, we borrow from Lucene's vectorized implementation, differing significantly from the textbook description in order to improve throughput.
The main novelty is that postings iterators are no longer advanced in an alternating fashion. In traditional MAXSCORE implementations, iterators advance alternately: if iterator A is at doc ID 5 and iterator B is at doc ID 3, the algorithm advances B to catch up, perhaps advancing B beyond A, causing A to then catch up, and so on.
Our implementation works differently. After computing the next range of doc IDs to evaluate based on local max scores, it processes all matching doc IDs and scores needed from each postings iterator in one batch before moving to the next iterator. This means the same iterator advances many times in a row (often tens of times or more) before switching to another iterator.
This batching approach achieves specific CPU-level throughput optimizations:
Designing fast search engines is a subtle optimization problem. On the one hand, CPUs reach peak efficiency on serial workloads with predictable branches. On the other hand, the smartest algorithms often have a random access pattern and lots of conditionals. So it takes some effort to select the right algorithm and then tune it to get the best out of the CPU.
Plus, context keeps changing. AVX-512 has now been widely available on server-side CPUs for several years, further moving the cursor in the direction of serial, branchless workloads. "Serial and dumb" can often beat "smart and random" on modern CPUs. Furthermore, agents write longer queries than humans, so it's becoming increasingly important for text search to scale well with the number of terms.
These factors force us to periodically revisit our choices of algorithms and their implementations. For text search, this means that the cursor has shifted more and more from WAND to MAXSCORE, which scales better with the number of terms and can be tuned to be more CPU-friendly.
Thanks for reading through this blog. I hope you enjoyed the deep dive, as well as the performance that you're now getting on text search with turbopuffer thanks to FTS v2.
Below you'll find results of turbopuffer's nightly benchmarks, comparing FTS v1 vs. FTS v2. It is reusing infrastructure from Search Benchmark, the Game: a 5M export of English Wikipedia, plus a mix of some AOL queries and handcrafted queries that exhibit similar characteristics to those we see in production.
| Query | top_k | FTS v1 latency (ms) | FTS v2 latency (ms) | Speedup (Multiplier) |
|---|---|---|---|---|
| the | 10 | 26.0 | 3.9 | 6.7x |
| the | 100 | 25.9 | 5.2 | 5.0x |
| the | 1000 | 28.8 | 12.8 | 2.2x |
| search | 10 | 3.1 | 1.4 | 2.2x |
| search | 100 | 3.5 | 2.0 | 1.8x |
| search | 1000 | 5.9 | 4.6 | 1.3x |
| new york | 10 | 17.5 | 4.0 | 4.4x |
| new york | 100 | 17.7 | 6.0 | 2.9x |
| new york | 1000 | 20.2 | 9.7 | 2.1x |
| san francisco | 10 | 7.1 | 2.2 | 3.2x |
| san francisco | 100 | 7.5 | 2.8 | 2.7x |
| san francisco | 1000 | 9.9 | 5.7 | 1.7x |
| the who | 10 | 57.4 | 4.8 | 12.0x |
| the who | 100 | 57.3 | 6.7 | 8.6x |
| the who | 1000 | 60.1 | 16.4 | 3.7x |
| the incredibles | 10 | 37.4 | 3.3 | 11.3x |
| the incredibles | 100 | 37.4 | 5.2 | 7.2x |
| the incredibles | 1000 | 40.9 | 13.7 | 3.0x |
| new york population | 10 | 31.0 | 6.5 | 4.8x |
| new york population | 100 | 31.3 | 8.3 | 3.8x |
| new york population | 1000 | 33.8 | 14.0 | 2.4x |
| world bank president | 10 | 20.9 | 4.6 | 4.5x |
| world bank president | 100 | 21.1 | 6.2 | 3.4x |
| world bank president | 1000 | 23.9 | 10.4 | 2.3x |
| united states constitution | 10 | 19.8 | 3.8 | 5.2x |
| united states constitution | 100 | 19.8 | 5.1 | 3.9x |
| united states constitution | 1000 | 23.0 | 9.4 | 2.4x |
| law school rankings | 10 | 14.9 | 2.7 | 5.5x |
| law school rankings | 100 | 15.3 | 4.3 | 3.6x |
| law school rankings | 1000 | 18.0 | 9.9 | 1.8x |
| lord of the rings | 10 | 74.5 | 4.2 | 17.7x |
| lord of the rings | 100 | 75.4 | 6.3 | 12.0x |
| lord of the rings | 1000 | 79.8 | 15.6 | 5.1x |
| story of a girl | 10 | 84.2 | 5.0 | 16.8x |
| story of a girl | 100 | 84.3 | 7.1 | 11.9x |
| story of a girl | 1000 | 86.5 | 16.4 | 5.3x |
| to be or not to be | 10 | 99.7 | 18.4 | 5.4x |
| to be or not to be | 100 | 100.6 | 22.3 | 4.5x |
| to be or not to be | 1000 | 103.8 | 31.1 | 3.3x |
| arsenal player 1998 forward goal non flying dutch | 10 | 32.7 | 7.2 | 4.5x |
| arsenal player 1998 forward goal non flying dutch | 100 | 33.1 | 8.6 | 3.8x |
| arsenal player 1998 forward goal non flying dutch | 1000 | 36.2 | 13.6 | 2.7x |
| france president world war resistance london appeal | 10 | 53.7 | 10.6 | 5.1x |
| france president world war resistance london appeal | 100 | 54.4 | 12.0 | 4.5x |
| france president world war resistance london appeal | 1000 | 57.6 | 18.9 | 3.0x |
| pop singer songwriter born 1989 won best country song time person of year | 10 | 172.9 | 15.9 | 10.9x |
| pop singer songwriter born 1989 won best country song time person of year | 100 | 174.4 | 20.1 | 8.7x |
| pop singer songwriter born 1989 won best country song time person of year | 1000 | 178.0 | 32.4 | 5.5x |
| kenyan world marathon record olympic champion bbc world sport star of the year | 10 | 143.5 | 12.0 | 12.0x |
| kenyan world marathon record olympic champion bbc world sport star of the year | 100 | 143.2 | 17.2 | 8.3x |
| kenyan world marathon record olympic champion bbc world sport star of the year | 1000 | 147.3 | 30.4 | 4.8x |
| a search engine is an information retrieval software system designed to help find information stored on one or more computer systems | 10 | 350.2 | 18.3 | 19.1x |
| a search engine is an information retrieval software system designed to help find information stored on one or more computer systems | 100 | 350.0 | 26.7 | 13.1x |
| a search engine is an information retrieval software system designed to help find information stored on one or more computer systems | 1000 | 357.7 | 47.7 | 7.5x |
| a database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure indexes are used to quickly locate data without having to search every row in a database table every time said table is accessed | 10 | 570.9 | 30.5 | 18.7x |
| a database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure indexes are used to quickly locate data without having to search every row in a database table every time said table is accessed | 100 | 575.1 | 43.3 | 13.3x |
| a database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure indexes are used to quickly locate data without having to search every row in a database table every time said table is accessed | 1000 | 577.7 | 80.7 | 7.2x |
We host 1T+ documents, handle writes at 10M+ writes/s, and serve 10k+ queries/s. We are ready for far more. We hope you'll trust us with your queries.
Get started