Schema

{GET, POST} /v1/namespaces/:namespace/schema

Reads or updates the namespace schema.

turbopuffer maintains a schema for each namespace with type and indexing behaviour for each attribute.

The schema can be modified as you write documents.

A basic schema will be automatically inferred from the upserted data. You can explicitly configure a schema to specify types that can't be inferred (e.g. UUIDs) or to control indexing behaviour (e.g. enabling full-text search for an attribute). If any parameters are specified for an attribute, the type for that attribute must also be explicitly defined.

Parameters

Every attribute can have the following fields in its schema specified at write time:

type stringrequired true

The data type of the attribute. Supported types:

string: String
int: Signed integer (i64)
uint: Unsigned integer (u64)
float: Floating-point number (f64)
uuid: 128-bit UUID
datetime: Date and time
bool: Boolean
[]string: Array of strings
[]int: Array of signed integers
[]uint: Array of unsigned integers
[]float: Array of floating-point numbers
[]uuid: Array of UUIDs
[]datetime: Array of dates and times

All attribute types are nullable by default, except id and vector which are required. vector will become an optional attribute soon. If you need a namespace without a vector, simply set vector to a random float.

string, int and bool types and their array variants can be inferred from the write payload. Other types must be set explicitly in the schema. See UUID values for an example.

By default, integers use a 64-bit signed type (int). To use an unsigned type, set the attribute type to uint explicitly in the schema.

datetime values should be provided as an ISO 8601 formatted string with a mandatory date and optional time and time zone. Internally, these values are converted to UTC (if the time zone is specified) and stored as a 64-bit integer representing milliseconds since the epoch.

Example: ["2015-01-20", "2015-01-20T12:34:56", "2015-01-20T12:34:56-04:00"]

We'll be adding other data types soon. In the meantime, we suggest representing other data types as either strings or numbers.

vector objectdefault: {'type': [dims]f32, 'ann': true}

Whether the upserted vectors are of type f16 or f32.

To use f16 vectors, this field needs to be explicitly specified in the schema when first creating (i.e. writing to) a namespace.

Example: "vector": {"type": "[512]f16", "ann": true}

filterable booleandefault: true (false if full-text search or regex is enabled)

Whether or not the attribute can be used in filters/WHERE clauses. Filtered attributes are indexed into an inverted index. At query-time, the filter evaluation is recall-aware when used for vector queries.

Unfiltered attributes don't have an index built for them, and are thus billed at a 50% discount (see pricing).

regex booleandefault: false

Whether to enable Regex filters on this attribute. If set, filterable defaults to false; you can override this by setting filterable: true.

full_text_search boolean | objectdefault: false

Whether this attribute can be used as part of a BM25 full-text search. Requires the string or []string type, and by default, BM25-enabled attributes are not filterable. You can override this by setting filterable: true.

Can either be a boolean for default settings, or an object with the following optional fields:

language (string): The language of the text. Defaults to english. See: Supported languages
stemming (boolean): Language-specific stemming for the text. Defaults to false (i.e. do not stem).
remove_stopwords (boolean): Removes common words from the text based on language. Defaults to true (i.e. remove common words).
case_sensitive (boolean): Whether searching is case-sensitive. Defaults to false (i.e. case-insensitive).
tokenizer (string): How to convert the text to a list of tokens. Defaults to word_v1. See: Supported tokenizers
k1 (float): Term frequency saturation parameter for BM25 scoring. Must be greater than zero. Defaults to 1.2. See: Advanced tuning
b (float): Document length normalization parameter for BM25 scoring. Must be in the range [0.0, 1.0]. Defaults to 0.75. See: Advanced tuning
max_token_length (integer): Maximum length of a token in bytes. Tokens larger than this value during tokenization will be filtered out. Has to be between 1 and 254 (inclusive). Defaults to 39.

If you require other types of full-text search options, please contact us.

Adding new attributes

New attributes can be added with a write or an explicit schema update. All documents prior to the schema update will have the attribute set to null.

In most cases, the schema is inferred from the data you write. However, as part of a write, you can choose to specify the schema for attributes through above parameters (i.e. to use UUID values or enable BM25 full-text indexing).

Changing existing attributes

We support online, in-place changes of the filterable and full_text_search settings, by setting the schema in a write or by sending an explicit schema update.

Other index settings changes, attribute type changes, and attribute deletions currently cannot be done in-place. Consider exporting documents and upserting into a new namespace if you require a schema change.

After enabling the filterable setting for an attribute, or adding/updating a full-text index, the index needs time to build before queries that depend on the index can be executed. turbopuffer will respond with HTTP status 202 to queries that depend on an index that is not yet built.

Inspect

To retrieve the current schema for a namespace, make a GET request to /v1/namespaces/:namespace/schema.

Update

To update the schema for a namespace without a write, make a POST request to /v1/namespaces/:namespace/schema.

For example, to change an attribute called my-text to unfilterable:

Languages for full-text search

turbopuffer currently supports language-aware stemming and stopword removal for full-text search. The following languages are supported:

arabic
danish
dutch
english (default)
finnish
french
german
greek
hungarian
italian
norwegian
portuguese
romanian
russian
spanish
swedish
tamil
turkish

Other languages can be supported by contacting us.

Tokenizers for full-text search

word_v2
word_v1 (default)
word_v0
pre_tokenized_array

The word_v2 tokenizer forms tokens from ideographic codepoints, contiguous sequences of alphanumeric codepoints, and sequences of emoji codepoints that form a single glyph. Codepoints that are not alphanumeric, ideographic, or an emoji are discarded. Codepoints are classified according to Unicode v16.0.

The word_v1 tokenizer works like the word_v2 tokenizer, except that ideographic codepoints are treated as alphanumeric codepoint. Codepoints are classified according to Unicode v10.0.

The word_v0 tokenizer works like the word_v1 tokenizer, except that emoji codepoints are discarded.

The pre_tokenized_array tokenizer is a special tokenizer that indicates that you want to perform your own tokenization. This tokenizer can only be used on attributes of type []string; each string in the array is interpreted as a token. When this tokenizer is active, queries using the BM25 or ContainsAllTokens operators must supply a query operand of type []string rather than string; each string in the array is interpreted as a token. Tokens are always matched case sensitively, without stemming or stopword removal. You cannot specify language, stemming: true, remove_stopwords: true, or case_sensitive: false when using this tokenizer.

Other tokenizers can be supported by contacting us.

Advanced tuning for full-text search

The BM25 scoring algorithm involves two parameters that can be tuned for your workload:

k1 controls how quickly the impact of term frequency saturates. When k1 is close to zero, term frequency is effectively ignored when scoring a document. When k1 is close to infinity, term frequency contributes nearly linearly to the score.

The default value, 1.2, means that increasing term frequency in a document boosts heavily to start but quickly results in diminishing returns.
b controls document length normalization. When b is 0.0, documents are treated equally regardless of length, which allows long articles tend to dominate due to sheer volume of terms. When b is 1.0, documents are boosted or penalized based on the ratio of their length to the average document length in the corpus.

The default value, 0.75, controls for length bias without eliminating it entirely (long documents are often legitimately more relevant).

The default values are suitable for most applications. Tuning is typically required only if your corpus consists of extremely short texts like tweets (decrease k1 and b) or extremely long texts like legal documents (increase k1 and b).

To tune these parameters, we recommend an empirical approach: build a set of evals, and choose the parameter values that maximize performance on those evals.