Schema

{GET, POST} /v1/namespaces/:namespace/schema

Reads or updates the namespace schema.

turbopuffer maintains a schema for each namespace with type and indexing behaviour for each attribute.

The schema can be modified as you upsert documents.

A basic schema will be automatically inferred from the upserted data. You can explicitly configure a schema to specify types that can't be inferred (e.g. UUIDs) or to control indexing behaviour (e.g. enabling full-text search for an attribute). If any parameters are specified for an attribute, the type for that attribute must also be explicitly defined.

Parameters

Every attribute can have the following fields in its schema specified at upsert time:

type stringrequired true

The data type of the attribute. Supported types:

  • string: String
  • int: Signed integer (i64)
  • uint: Unsigned integer (u64)
  • uuid: 128-bit UUID
  • datetime: Date and time
  • bool: Boolean
  • []string: Array of strings
  • []int: Array of signed integers
  • []uint: Array of unsigned integers
  • []uuid: Array of UUIDs
  • []datetime: Array of dates and times

All attribute types are nullable by default, except id and vector which are required. vector will become an optional attribute soon. If you need a namespace without a vector, simply set vector to a random float.

Most types can be inferred from the upsert payload, except uuid, datetime, and their array variants, which all need to be set explicitly in the schema. See UUID values for an example.

By default, integers use a 64-bit signed type (int). To use an unsigned type, set the attribute type to uint explicitly in the schema.

datetime values should be provided as an ISO 8601 formatted string with a mandatory date and optional time and time zone. Internally, these values are converted to UTC (if the time zone is specified) and stored as a 64-bit integer representing milliseconds since the epoch.

Example: ["2015-01-20", "2015-01-20T12:34:56", "2015-01-20T12:34:56-04:00"]

We'll be adding other data types soon. In the meantime, we suggest representing other data types as either strings or integers.


filterable booleandefault: true (false if full-text search is enabled)

Whether or not the attribute can be used in filters/WHERE clauses.

Unfiltered attributes don't have an index built for them, and are thus billed at a 50% discount (see pricing).


full_text_search booleandefault: false

Whether this attribute can be used as part of a BM25 full-text search. Requires the string or []string type, and by default, BM25-enabled attributes are not filterable. You can override this by setting filterable: true.

Can either be a boolean for default settings, or an object with the following optional fields:

  • language (string): The language of the text. Defaults to english. See: Supported languages
  • stemming (boolean): Language-specific stemming for the text. Defaults to false (i.e. do not stem).
  • remove_stopwords (boolean): Removes common words from the text based on language. Defaults to true (i.e. remove common words).
  • case_sensitive (boolean): Whether searching is case-sensitive. Defaults to false (i.e. case-insensitive).
  • tokenizer (string): How to convert the text to a list of tokens. Defaults to word_v1. See: Supported tokenizers

If you require other types of full-text search options, please contact us.


vector objectdefault: {'type': [dims]f32, 'ann': true}

Whether the upserted vectors are of type f16 or f32.

To use f16 vectors, this field needs to be explicitly specified in the schema when first creating (i.e. upserting to) a namespace.

Example: "vector": {"type": [512]f16, "ann": true}

Adding new attributes

New attributes can be added with an upsert. All documents prior to the write will have the attribute set to null.

In most cases, the schema is inferred from the data you upsert. However, as part of an upsert, you can choose to specify the schema for attributes through above parameters (i.e. to use UUID values or enable BM25 full-text indexing).

Changing existing attributes

We support online, in-place changes of the filterable and full_text_search settings, by setting the schema in an upsert.

Other index settings changes, attribute type changes, and attribute deletions currently cannot be done in-place. Consider exporting documents and upserting into a new namespace if you require a schema change.

After enabling the filterable setting for an attribute, or adding/updating a full-text index, the index needs time to build before queries that depend on the index can be executed. turbopuffer will respond with HTTP status 202 to queries that depend on an index that is not yet built.

Inspect

To retrieve the current schema for a namespace, make a GET request to /v1/namespaces/:namespace/schema.

turbopuffer currently supports language-aware stemming and stopword removal for full-text search. The following languages are supported:

  • arabic
  • danish
  • dutch
  • english (default)
  • finnish
  • french
  • german
  • greek
  • hungarian
  • italian
  • norwegian
  • portuguese
  • romanian
  • russian
  • spanish
  • swedish
  • tamil
  • turkish

Other languages can be supported by contacting us.

  • word_v1 (default)
  • word_v0
  • pre_tokenized_array

The word_v1 tokenizer forms tokens from contiguous sequences of alphanumeric codepoints and sequences of emoji codepoints that form a single glyph. Codepoints that are neither alphanumeric nor an emoji are discarded. Codepoints are classified according to v10.0 of the Unicode specification.

The word_v0 tokenizer works like the word_v1 tokenizer, except that emoji codepoints are discarded.

The pre_tokenized_array tokenizer is a special tokenizer that indicates that you want to perform your own tokenization. This tokenizer can only be used on attributes of type []string; each string in the array is interpreted as a token. When this tokenizer is active, queries using the BM25 or ContainsAllTokens operators must supply a query operand of type []string rather than string; each string in the array is interpreted as a token. Tokens are always matched case sensitively, without stemming or stopword removal. You cannot specify language, stemming: true, remove_stopwords: true, or case_sensitive: false when using this tokenizer.

Other tokenizers can be supported by contacting us.