Full-Text Search Guide
turbopuffer supports BM25 full-text search for string and []string types. This guide shows how to configure and use full-text search with different options.
turbopuffer's full-text search engine has been written from the ground up for the turbopuffer storage engine for low latency searches directly on object storage.
For hybrid search combining both vector and BM25 results, see Hybrid Search.
For all available full-text search options, see the Schema documentation.
Basic example
The simplest form of full-text search is on a single field of type string.
# $ pip install turbopuffer
import turbopuffer
import os
tpuf = turbopuffer.Turbopuffer(
# API tokens are created in the dashboard: https://turbopuffer.com/dashboard
api_key=os.getenv("TURBOPUFFER_API_KEY"),
region="gcp-us-central1", # choose best region: https://turbopuffer.com/docs/regions
)
ns = tpuf.namespace(f'fts-basic-example-py')
ns.write(
upsert_rows=[
{
'id': 1,
'content': 'turbopuffer is a fast search engine with FTS, filtering, and vector search support'
},
{
'id': 2,
'content': 'turbopuffer can store billions and billions of documents cheaper than any other search engine'
},
{
'id': 3,
'content': 'turbopuffer will support many more types of queries as it evolves. turbopuffer will only get faster.'
}
],
schema={
'content': {
'type': 'string',
# Enable BM25 with default settings
# For all config options, see https://turbopuffer.com/docs/write#schema
'full_text_search': True
}
}
)
# Basic FTS search.
results = ns.query(
rank_by=('content', 'BM25', 'turbopuffer'),
limit=10,
include_attributes=['content']
)
# [3, 1, 2] is the default BM25 ranking based on document length and
# term frequency
print(results)
# Simple phrase matching filter, to limit results to documents that contain the
# terms "search" and "engine"
results = ns.query(
rank_by=('content', 'BM25', 'turbopuffer'),
filters=('content', 'ContainsAllTokens', 'search engine'),
limit=10,
include_attributes=['content']
)
# [1, 2] (same as above, but without document 3)
print(results)
# To combine with vector search, see:
# https://turbopuffer.com/docs/hybrid-searchAdvanced example
You can use full-text search operators like Sum and Product to perform a full-text search across multiple attributes simultaneously.
import turbopuffer
tpuf = turbopuffer.Turbopuffer(
region='gcp-us-central1', # choose best region: https://turbopuffer.com/docs/regions
)
ns = tpuf.namespace(f'fts-advanced-example-py')
# Write some documents with a rich set of attributes.
ns.write(
upsert_rows=[
{
'id': 1,
'title': 'Getting Started with Python',
'content': 'Learn Python basics including variables, functions, and classes',
'tags': ['python', 'programming', 'beginner'],
'language': 'en',
'publish_date': 1709251200
},
{
'id': 2,
'title': 'Advanced TypeScript Tips',
'content': 'Discover advanced TypeScript features and type system tricks',
'tags': ['typescript', 'javascript', 'advanced'],
'language': 'en',
'publish_date': 1709337600
},
{
'id': 3,
'title': 'Python vs JavaScript',
'content': 'Compare Python and JavaScript for web development',
'tags': ['python', 'javascript', 'comparison'],
'language': 'en',
'publish_date': 1709424000
}
],
schema={
'title': {
'type': 'string',
'full_text_search': {
# See all FTS indexing options at
# https://turbopuffer.com/docs/write#param-full_text_search
'language': 'english',
'stemming': True,
'remove_stopwords': True,
'case_sensitive': False
}
},
'content': {
'type': 'string',
'full_text_search': {
'language': 'english',
'stemming': True,
'remove_stopwords': True
}
},
'tags': {
'type': '[]string',
'full_text_search': {
'stemming': False,
'remove_stopwords': False,
'case_sensitive': True
}
}
}
)
# Advanced FTS search.
# In this example, hits on `title` and `tags` are weighted / boosted higher than
# hits on `content`.
result = ns.query(
# See all FTS query options at https://turbopuffer.com/docs/query
rank_by=('Sum', (
('Product', 3, ('title', 'BM25', 'python beginner')),
('Product', 2, ('tags', 'BM25', 'python beginner')),
('content', 'BM25', 'python beginner')
)),
filters=('And', (
('publish_date', 'Gte', 1709251200),
('language', 'Eq', 'en'),
)),
limit=10,
include_attributes=['title', 'content', 'tags']
)
print(result.rows)
# To combine with vector search, see:
# https://turbopuffer.com/docs/hybrid-searchCustom tokenization
When turbopuffer's built-in tokenizers aren't sufficient, use the
pre_tokenized_array tokenizer
to perform client side tokenization using arbitrary logic.
import turbopuffer
from typing import List
tpuf = turbopuffer.Turbopuffer(
region='gcp-us-central1', # choose best region: https://turbopuffer.com/docs/regions
)
# A simple word tokenizer that preserves hyphens instead of splitting on them.
def tokenize(text: str) -> List[str]:
# Replace all characters besides alphanumeric and '-' with spaces.
cleaned = ""
for ch in text:
if ch.isalnum() or ch in "-":
cleaned += ch
else:
cleaned += str(" ")
# Lowercase and split on spaces.
return cleaned.lower().split()
# Write some sample data.
ns = tpuf.namespace(f'fts-custom-tokenization-example-py')
ns.write(
upsert_rows=[
{"id": 1, "content": tokenize("We hold these truths to be self-evident.")},
{"id": 2, "content": tokenize("For my own self, it seemed evident.")},
],
schema={
'content': {
'type': '[]string',
'full_text_search': {'tokenizer': 'pre_tokenized_array'}
}
}
)
# Query for "self" and "evident".
results = ns.query(
# Notice that the BM25 operator now expects a list of tokens, not a string.
rank_by=('content', 'BM25', ['self', 'evident']),
limit=10,
)
# Only document 2 is matched, because document 1 has the token "self-evident"
# but neither the token "self" nor "evident".
print(results)
# Query for "self-evident".
results = ns.query(
rank_by=('content', 'BM25', ['self-evident']),
limit=10,
)
# Now only document 1 is matched.
print(results)
# To accept string queries, simply apply the tokenizer to the query string
# before passing it to the `BM25` operator.
def query_string(query: str):
return ns.query(
rank_by=('content', 'BM25', tokenize(query)),
limit=10,
)Supported languages
turbopuffer currently supports language-aware stemming and stopword removal for full-text search. The following languages are supported:
For latin-script languages with diacritics (e.g. french, spanish), consider
enabling ascii_folding in your BM25
params.
Other languages can be supported by contacting us.
Tokenizers
word_v4(default for new namespaces)word_v3word_v2word_v1word_v0pre_tokenized_array
The default tokenizer is periodically upgraded. If your application relies on specific tokenization behavior, you should explicitly specify a tokenizer in the schema.
The word_v4 and word_v3 tokenizers use Unicode v17.0 text segmentation rules (UAX #29) for accurate segmentation across most languages, scripts, and emojis. word_v4 is the current default for new namespaces; it behaves like word_v3, but is roughly 3x faster and fixes a few tokenization edge cases. It's powered by our open-source alyze library.
Loading…
The small number on each token is its position. Every word-like token consumes a position even when a filter (length, stopword) drops it, so positions can have gaps — this keeps phrase distances accurate.
The word_v2 tokenizer forms tokens from ideographic codepoints, contiguous
sequences of alphanumeric codepoints, and sequences of emoji codepoints that
form a single glyph. Codepoints that are not alphanumeric, ideographic, or an
emoji are discarded. Codepoints are classified according to Unicode v16.0.
The word_v1 tokenizer works like the word_v2 tokenizer, except that
ideographic codepoints are treated as alphanumeric codepoint. Codepoints are
classified according to Unicode v10.0.
The word_v0 tokenizer works like the word_v1 tokenizer, except that emoji
codepoints are discarded.
The pre_tokenized_array tokenizer is a special tokenizer that indicates that
you want to perform your own tokenization. This tokenizer can only be used on
attributes of type []string; each string in the array is interpreted as a
token. When this tokenizer is active, queries using the BM25 or
ContainsAllTokens operators must supply a query operand of type []string
rather than string; each string in the array is interpreted as a token. Tokens
are always matched case sensitively, without stemming or stopword removal. You
cannot specify language, stemming: true, remove_stopwords: true, or
case_sensitive: false when using this tokenizer.
New tokenizers can be requested by contacting us.
Fuzzy matching
turbopuffer supports fuzzy string matching within a specified edit distance via the Fuzzy filter. Fuzzy filters require the fuzzy schema parameter to be set to true on the queried attribute.
The max_edit_distance parameter determines the maximum allowable number of edits for a query string of specified number of characters to match the filter. For example:
"max_edit_distance": [
# Queries >= 6 characters match on substrings within 1 edit
# Queries >= 9 characters match on substrings within 2 edits
# Queries < 6 characters match nothing
{ "min_query_chars": 6, "distance": 1 },
{ "min_query_chars": 9, "distance": 2 }
],
A missing or added character, incorrect character, missing or added diacritic (e.g. ü), or case difference will add 1 to the edit distance by default. If the case_sensitive parameter is set to false, case differences do not count toward the edit distance.
Fuzzy matching can be used as a filter directly, or within a rank_by expression as a Rank by filter, possibly in conjunction with other expressions:
# $ pip install turbopuffer
import turbopuffer
tpuf = turbopuffer.Turbopuffer(
region='gcp-us-central1', # choose best region: https://turbopuffer.com/docs/regions
)
ns = tpuf.namespace(f'fts-fuzzy-example-py')
ns.write(
upsert_rows=[
{'id': 1, 'company_name': 'turbopuffer'},
{'id': 2, 'company_name': 'turbopufer inc'},
],
schema={
'company_name': {
'type': 'string',
'fuzzy': True,
'glob': True,
},
},
)
result = ns.query(
rank_by=('Sum', (
('Product', 3, ('company_name', 'Glob', '*turbopufer*')),
('company_name', 'Fuzzy', 'turbopufer', {
'max_edit_distance': [
{'min_query_chars': 6, 'distance': 1},
],
"case_sensitive": False
}),
)),
include_attributes=["company_name"],
limit=10
)
print(result.rows)This query will prioritize values that contain exactly "turbopufer" as a substring, while simultaneously ensuring that values that contain a substring within 1 edit are returned (since the query has >= 6 characters).
Advanced tuning
The BM25 scoring algorithm involves three parameters that can be tuned for your workload:
-
k1controls how quickly the impact of term frequency saturates. Whenk1is close to zero, term frequency is effectively ignored when scoring a document. Whenk1is close to infinity, term frequency contributes nearly linearly to the score.The default value,
1.2, means that increasing term frequency in a document boosts heavily to start but quickly results in diminishing returns. -
bcontrols document length normalization. Whenbis0.0, documents are treated equally regardless of length, which allows long articles to dominate due to sheer volume of terms. Whenbis1.0, documents are boosted or penalized based on the ratio of their length to the average document length in the corpus.The default value,
0.75, controls for length bias without eliminating it entirely (long documents are often legitimately more relevant). -
k3controls the saturation point for query term frequency. When a query contains repeated terms,k3determines how much additional weight each repetition contributes. Whenk3is close to zero, query term repetition is effectively ignored. Whenk3is large, repeated query terms contribute nearly linearly to the score.The default value,
8.0, allows repeated query terms to have a meaningful impact on scoring while still applying diminishing returns.
The default values are suitable for most applications. Tuning k1 and b is typically
required only if your corpus consists of extremely short texts like tweets
(decrease k1 and b) or extremely long texts like legal documents (increase
k1 and b).
To tune these parameters, we recommend an empirical approach: build a set of evals, and choose the parameter values that maximize performance on those evals.