Full-Text Search Guide

turbopuffer supports BM25 full-text search for string and []string types. This guide shows how to configure and use full-text search with different options.

turbopuffer's full-text search engine has been written from the ground up for the turbopuffer storage engine for low latency searches directly on object storage.

For hybrid search combining both vector and BM25 results, see Hybrid Search.

For all available full-text search options, see the Schema documentation.

Basic example

The simplest form of full-text search is on a single field of type string.

# $ pip install turbopuffer
import turbopuffer
import os

tpuf = turbopuffer.Turbopuffer(
    # API tokens are created in the dashboard: https://turbopuffer.com/dashboard
    api_key=os.getenv("TURBOPUFFER_API_KEY"),
    region="gcp-us-central1", # choose best region: https://turbopuffer.com/docs/regions
)

ns = tpuf.namespace(f'fts-basic-example-py')
ns.write(
    upsert_rows=[
        {
            'id': 1,
            'content': 'turbopuffer is a fast search engine with FTS, filtering, and vector search support'
        },
        {
            'id': 2,
            'content': 'turbopuffer can store billions and billions of documents cheaper than any other search engine'
        },
        {
            'id': 3,
            'content': 'turbopuffer will support many more types of queries as it evolves. turbopuffer will only get faster.'
        }
    ],
    schema={
        'content': {
            'type': 'string',
            # Enable BM25 with default settings
            # For all config options, see https://turbopuffer.com/docs/write#schema
            'full_text_search': True
        }
    }
)

# Basic FTS search.
results = ns.query(
    rank_by=('content', 'BM25', 'turbopuffer'),
    limit=10,
    include_attributes=['content']
)
# [3, 1, 2] is the default BM25 ranking based on document length and
# term frequency
print(results)

# Simple phrase matching filter, to limit results to documents that contain the
# terms "search" and "engine"
results = ns.query(
    rank_by=('content', 'BM25', 'turbopuffer'),
    filters=('content', 'ContainsAllTokens', 'search engine'),
    limit=10,
    include_attributes=['content']
)
# [1, 2] (same as above, but without document 3)
print(results)

# To combine with vector search, see:
# https://turbopuffer.com/docs/hybrid-search

Advanced example

You can use full-text search operators like Sum and Product to perform a full-text search across multiple attributes simultaneously.

import turbopuffer

tpuf = turbopuffer.Turbopuffer(
    region='gcp-us-central1', # choose best region: https://turbopuffer.com/docs/regions
)

ns = tpuf.namespace(f'fts-advanced-example-py')

# Write some documents with a rich set of attributes.
ns.write(
    upsert_rows=[
        {
            'id': 1,
            'title': 'Getting Started with Python',
            'content': 'Learn Python basics including variables, functions, and classes',
            'tags': ['python', 'programming', 'beginner'],
            'language': 'en',
            'publish_date': 1709251200
        },
        {
            'id': 2,
            'title': 'Advanced TypeScript Tips',
            'content': 'Discover advanced TypeScript features and type system tricks',
            'tags': ['typescript', 'javascript', 'advanced'],
            'language': 'en',
            'publish_date': 1709337600
        },
        {
            'id': 3,
            'title': 'Python vs JavaScript',
            'content': 'Compare Python and JavaScript for web development',
            'tags': ['python', 'javascript', 'comparison'],
            'language': 'en',
            'publish_date': 1709424000
        }
    ],
    schema={
        'title': {
            'type': 'string',
            'full_text_search': {
                # See all FTS indexing options at
                # https://turbopuffer.com/docs/write#param-full_text_search
                'language': 'english',
                'stemming': True,
                'remove_stopwords': True,
                'case_sensitive': False
            }
        },
        'content': {
            'type': 'string',
            'full_text_search': {
                'language': 'english',
                'stemming': True,
                'remove_stopwords': True
            }
        },
        'tags': {
            'type': '[]string',
            'full_text_search': {
                'stemming': False,
                'remove_stopwords': False,
                'case_sensitive': True
            }
        }
    }
)

# Advanced FTS search.
# In this example, hits on `title` and `tags` are weighted / boosted higher than
# hits on `content`.
result = ns.query(
    # See all FTS query options at https://turbopuffer.com/docs/query
    rank_by=('Sum', (
        ('Product', 3, ('title', 'BM25', 'python beginner')),
        ('Product', 2, ('tags', 'BM25', 'python beginner')),
        ('content', 'BM25', 'python beginner')
    )),
    filters=('And', (
        ('publish_date', 'Gte', 1709251200),
        ('language', 'Eq', 'en'),
    )),
    limit=10,
    include_attributes=['title', 'content', 'tags']
)
print(result.rows)

# To combine with vector search, see:
# https://turbopuffer.com/docs/hybrid-search

Custom tokenization

When turbopuffer's built-in tokenizers aren't sufficient, use the pre_tokenized_array tokenizer to perform client side tokenization using arbitrary logic.

import turbopuffer
from typing import List

tpuf = turbopuffer.Turbopuffer(
    region='gcp-us-central1', # choose best region: https://turbopuffer.com/docs/regions
)

# A simple word tokenizer that preserves hyphens instead of splitting on them.
def tokenize(text: str) -> List[str]:
    # Replace all characters besides alphanumeric and '-' with spaces.
    cleaned = ""
    for ch in text:
        if ch.isalnum() or ch in "-":
            cleaned += ch
        else:
            cleaned += str(" ")
    # Lowercase and split on spaces.
    return cleaned.lower().split()

# Write some sample data.
ns = tpuf.namespace(f'fts-custom-tokenization-example-py')
ns.write(
    upsert_rows=[
        {"id": 1, "content": tokenize("We hold these truths to be self-evident.")},
        {"id": 2, "content": tokenize("For my own self, it seemed evident.")},
    ],
    schema={
        'content': {
            'type': '[]string',
            'full_text_search': {'tokenizer': 'pre_tokenized_array'}
        }
    }
)

# Query for "self" and "evident".
results = ns.query(
    # Notice that the BM25 operator now expects a list of tokens, not a string.
    rank_by=('content', 'BM25', ['self', 'evident']),
    limit=10,
)
# Only document 2 is matched, because document 1 has the token "self-evident"
# but neither the token "self" nor "evident".
print(results)

# Query for "self-evident".
results = ns.query(
    rank_by=('content', 'BM25', ['self-evident']),
    limit=10,
)
# Now only document 1 is matched.
print(results)

# To accept string queries, simply apply the tokenizer to the query string
# before passing it to the `BM25` operator.
def query_string(query: str):
    return ns.query(
        rank_by=('content', 'BM25', tokenize(query)),
        limit=10,
    )

Supported languages

turbopuffer currently supports language-aware stemming and stopword removal for full-text search. The following languages are supported:

arabicdanishdutchenglish (default)finnishfrenchgermangreekhungarianitaliannorwegianportugueseromanianrussianspanishswedishtamilturkish

For latin-script languages with diacritics (e.g. french, spanish), consider enabling ascii_folding in your BM25 params.

Other languages can be supported by contacting us.

Tokenizers

  • word_v4 (default for new namespaces)
  • word_v3
  • word_v2
  • word_v1
  • word_v0
  • pre_tokenized_array

The default tokenizer is periodically upgraded. If your application relies on specific tokenization behavior, you should explicitly specify a tokenizer in the schema.

The word_v4 and word_v3 tokenizers use Unicode v17.0 text segmentation rules (UAX #29) for accurate segmentation across most languages, scripts, and emojis. word_v4 is the current default for new namespaces; it behaves like word_v3, but is roughly 3x faster and fixes a few tokenization edge cases. It's powered by our open-source alyze library.

languageOnly affects stemming & stopword removal
max_token_lengthIn characters (1–255)
Tokens0 tokens

Loading…

The small number on each token is its position. Every word-like token consumes a position even when a filter (length, stopword) drops it, so positions can have gaps — this keeps phrase distances accurate.

The word_v2 tokenizer forms tokens from ideographic codepoints, contiguous sequences of alphanumeric codepoints, and sequences of emoji codepoints that form a single glyph. Codepoints that are not alphanumeric, ideographic, or an emoji are discarded. Codepoints are classified according to Unicode v16.0.

The word_v1 tokenizer works like the word_v2 tokenizer, except that ideographic codepoints are treated as alphanumeric codepoint. Codepoints are classified according to Unicode v10.0.

The word_v0 tokenizer works like the word_v1 tokenizer, except that emoji codepoints are discarded.

The pre_tokenized_array tokenizer is a special tokenizer that indicates that you want to perform your own tokenization. This tokenizer can only be used on attributes of type []string; each string in the array is interpreted as a token. When this tokenizer is active, queries using the BM25 or ContainsAllTokens operators must supply a query operand of type []string rather than string; each string in the array is interpreted as a token. Tokens are always matched case sensitively, without stemming or stopword removal. You cannot specify language, stemming: true, remove_stopwords: true, or case_sensitive: false when using this tokenizer.

New tokenizers can be requested by contacting us.

Fuzzy matching

turbopuffer supports fuzzy string matching within a specified edit distance via the Fuzzy filter. Fuzzy filters require the fuzzy schema parameter to be set to true on the queried attribute.

The max_edit_distance parameter determines the maximum allowable number of edits for a query string of specified number of characters to match the filter. For example:

"max_edit_distance": [
  # Queries >= 6 characters match on substrings within 1 edit
  # Queries >= 9 characters match on substrings within 2 edits
  # Queries < 6 characters match nothing
  { "min_query_chars": 6, "distance": 1 },
  { "min_query_chars": 9, "distance": 2 }
],

A missing or added character, incorrect character, missing or added diacritic (e.g. ü), or case difference will add 1 to the edit distance by default. If the case_sensitive parameter is set to false, case differences do not count toward the edit distance.

Fuzzy matching can be used as a filter directly, or within a rank_by expression as a Rank by filter, possibly in conjunction with other expressions:

# $ pip install turbopuffer
import turbopuffer

tpuf = turbopuffer.Turbopuffer(
    region='gcp-us-central1', # choose best region: https://turbopuffer.com/docs/regions
)
ns = tpuf.namespace(f'fts-fuzzy-example-py')
ns.write(
    upsert_rows=[
        {'id': 1, 'company_name': 'turbopuffer'},
        {'id': 2, 'company_name': 'turbopufer inc'},
    ],
    schema={
        'company_name': {
            'type': 'string',
            'fuzzy': True,
            'glob': True,
        },
    },
)
result = ns.query(
    rank_by=('Sum', (
        ('Product', 3, ('company_name', 'Glob', '*turbopufer*')),
        ('company_name', 'Fuzzy', 'turbopufer', {
            'max_edit_distance': [
                {'min_query_chars': 6, 'distance': 1},
            ],
            "case_sensitive": False
        }),
    )),
    include_attributes=["company_name"],
    limit=10
)
print(result.rows)

This query will prioritize values that contain exactly "turbopufer" as a substring, while simultaneously ensuring that values that contain a substring within 1 edit are returned (since the query has >= 6 characters).

Advanced tuning

The BM25 scoring algorithm involves three parameters that can be tuned for your workload:

  • k1 controls how quickly the impact of term frequency saturates. When k1 is close to zero, term frequency is effectively ignored when scoring a document. When k1 is close to infinity, term frequency contributes nearly linearly to the score.

    The default value, 1.2, means that increasing term frequency in a document boosts heavily to start but quickly results in diminishing returns.

  • b controls document length normalization. When b is 0.0, documents are treated equally regardless of length, which allows long articles to dominate due to sheer volume of terms. When b is 1.0, documents are boosted or penalized based on the ratio of their length to the average document length in the corpus.

    The default value, 0.75, controls for length bias without eliminating it entirely (long documents are often legitimately more relevant).

  • k3 controls the saturation point for query term frequency. When a query contains repeated terms, k3 determines how much additional weight each repetition contributes. When k3 is close to zero, query term repetition is effectively ignored. When k3 is large, repeated query terms contribute nearly linearly to the score.

    The default value, 8.0, allows repeated query terms to have a meaningful impact on scoring while still applying diminishing returns.

The default values are suitable for most applications. Tuning k1 and b is typically required only if your corpus consists of extremely short texts like tweets (decrease k1 and b) or extremely long texts like legal documents (increase k1 and b).

To tune these parameters, we recommend an empirical approach: build a set of evals, and choose the parameter values that maximize performance on those evals.