turbopuffer supports BM25 full-text search for string and []string types. This guide shows how to configure and use full-text search with different options.
turbopuffer's full-text search engine has been written from the ground up for the turbopuffer storage engine for low latency searches directly on object storage.
For hybrid search combining both vector and BM25 results, see Hybrid Search.
For all available full-text search options, see the Schema documentation.
The simplest form of full-text search is on a single field of type string
.
You can use full-text search operators like Sum and Product to perform a full-text search across multiple attributes simultaneously.
When turbopuffer's built-in tokenizers aren't sufficient, use the
pre_tokenized_array
tokenizer
to perform client side tokenization using arbitrary logic.
turbopuffer currently supports language-aware stemming and stopword removal for full-text search. The following languages are supported:
english (default) arabic hungarian portuguese swedish
danish finnish italian romanian tamil
dutch french norwegian russian turkish
german greek
Other languages can be supported by contacting us.
word_v2
word_v1
(default)word_v0
pre_tokenized_array
The word_v2
tokenizer forms tokens from ideographic codepoints, contiguous
sequences of alphanumeric codepoints, and sequences of emoji codepoints that
form a single glyph. Codepoints that are not alphanumeric, ideographic, or an
emoji are discarded. Codepoints are classified according to Unicode v16.0.
The word_v1
tokenizer works like the word_v2
tokenizer, except that
ideographic codepoints are treated as alphanumeric codepoint. Codepoints are
classified according to Unicode v10.0.
The word_v0
tokenizer works like the word_v1
tokenizer, except that emoji
codepoints are discarded.
The pre_tokenized_array
tokenizer is a special tokenizer that indicates that
you want to perform your own tokenization. This tokenizer can only be used on
attributes of type []string
; each string in the array is interpreted as a
token. When this tokenizer is active, queries using the BM25
or
ContainsAllTokens
operators must supply a query operand of type []string
rather than string
; each string in the array is interpreted as a token. Tokens
are always matched case sensitively, without stemming or stopword removal. You
cannot specify language
, stemming: true
, remove_stopwords: true
, or
case_sensitive: false
when using this tokenizer.
Other tokenizers can be supported by contacting us.
The BM25 scoring algorithm involves two parameters that can be tuned for your workload:
k1
controls how quickly the impact of term frequency saturates. When k1
is
close to zero, term frequency is effectively ignored when scoring a document.
When k1
is close to infinity, term frequency contributes nearly
linearly to the score.
The default value, 1.2
, means that increasing term frequency in a document
boosts heavily to start but quickly results in diminishing returns.
b
controls document length normalization. When b
is 0.0
, documents are
treated equally regardless of length, which allows long articles tend to
dominate due to sheer volume of terms. When b
is 1.0
, documents are
boosted or penalized based on the ratio of their length to the average
document length in the corpus.
The default value, 0.75
, controls for length bias without eliminating it
entirely (long documents are often legitimately more relevant).
The default values are suitable for most applications. Tuning is typically
required only if your corpus consists of extremely short texts like tweets
(decrease k1
and b
) or extremely long texts like legal documents (increase
k1
and b
).
To tune these parameters, we recommend an empirical approach: build a set of evals, and choose the parameter values that maximize performance on those evals.