Full-Text Search Guide

turbopuffer supports BM25 full-text search for string and []string types. This guide shows how to configure and use full-text search with different options.

turbopuffer's full-text search engine has been written from the ground up for the turbopuffer storage engine for low latency searches directly on object storage.

For hybrid search combining both vector and BM25 results, see Hybrid Search.

For all available full-text search options, see the Schema documentation.

Basic example

The simplest form of full-text search is on a single field of type string.

Advanced example

You can use full-text search operators like Sum and Product to perform a full-text search across multiple attributes simultaneously.

Custom tokenization

When turbopuffer's built-in tokenizers aren't sufficient, use the pre_tokenized_array tokenizer to perform client side tokenization using arbitrary logic.

Supported languages

turbopuffer currently supports language-aware stemming and stopword removal for full-text search. The following languages are supported:

english (default)   arabic     hungarian    portuguese   swedish
danish              finnish    italian      romanian     tamil
dutch               french     norwegian    russian      turkish
german              greek

Other languages can be supported by contacting us.

Tokenizers

word_v2
word_v1 (default)
word_v0
pre_tokenized_array

The word_v2 tokenizer forms tokens from ideographic codepoints, contiguous sequences of alphanumeric codepoints, and sequences of emoji codepoints that form a single glyph. Codepoints that are not alphanumeric, ideographic, or an emoji are discarded. Codepoints are classified according to Unicode v16.0.

The word_v1 tokenizer works like the word_v2 tokenizer, except that ideographic codepoints are treated as alphanumeric codepoint. Codepoints are classified according to Unicode v10.0.

The word_v0 tokenizer works like the word_v1 tokenizer, except that emoji codepoints are discarded.

The pre_tokenized_array tokenizer is a special tokenizer that indicates that you want to perform your own tokenization. This tokenizer can only be used on attributes of type []string; each string in the array is interpreted as a token. When this tokenizer is active, queries using the BM25 or ContainsAllTokens operators must supply a query operand of type []string rather than string; each string in the array is interpreted as a token. Tokens are always matched case sensitively, without stemming or stopword removal. You cannot specify language, stemming: true, remove_stopwords: true, or case_sensitive: false when using this tokenizer.

Other tokenizers can be supported by contacting us.

Advanced tuning

The BM25 scoring algorithm involves two parameters that can be tuned for your workload:

k1 controls how quickly the impact of term frequency saturates. When k1 is close to zero, term frequency is effectively ignored when scoring a document. When k1 is close to infinity, term frequency contributes nearly linearly to the score.

The default value, 1.2, means that increasing term frequency in a document boosts heavily to start but quickly results in diminishing returns.
b controls document length normalization. When b is 0.0, documents are treated equally regardless of length, which allows long articles tend to dominate due to sheer volume of terms. When b is 1.0, documents are boosted or penalized based on the ratio of their length to the average document length in the corpus.

The default value, 0.75, controls for length bias without eliminating it entirely (long documents are often legitimately more relevant).

The default values are suitable for most applications. Tuning is typically required only if your corpus consists of extremely short texts like tweets (decrease k1 and b) or extremely long texts like legal documents (increase k1 and b).

To tune these parameters, we recommend an empirical approach: build a set of evals, and choose the parameter values that maximize performance on those evals.