On this page
turbopuffer maintains a schema for each namespace with type and indexing behaviour for each attribute.
The schema can be modified as you upsert documents.
A basic schema will be automatically inferred from the upserted data. You can explicitly configure a schema to specify types that can't be inferred (e.g. UUIDs) or to control indexing behaviour (e.g. enabling full-text search for an attribute). If any parameters are specified for an attribute, the type for that attribute must also be explicitly defined.
Every attribute can have the following fields in its schema specified at upsert time:
The data type of the attribute. Supported types:
string
: Stringint
: Signed integer (i64)uint
: Unsigned integer (u64)uuid
: 128-bit UUIDdatetime
: Date and timebool
: Boolean[]string
: Array of strings[]int
: Array of signed integers[]uint
: Array of unsigned integers[]uuid
: Array of UUIDs[]datetime
: Array of dates and timesAll attribute types are nullable by default, except id
and vector
which are
required. vector
will become an optional attribute soon. If you need a
namespace without a vector, simply set vector
to a random float.
Most types can be inferred from the upsert payload, except uuid
, datetime
,
and their array variants, which all need to be set explicitly in the schema. See
UUID values for an example.
By default, integers use a 64-bit signed type (int
). To use an unsigned type, set
the attribute type to uint
explicitly in the schema.
datetime
values should be provided as an ISO 8601 formatted string with a
mandatory date and optional time and time zone. Internally, these values are
converted to UTC (if the time zone is specified) and stored as a 64-bit integer
representing milliseconds since the epoch.
Example: ["2015-01-20", "2015-01-20T12:34:56", "2015-01-20T12:34:56-04:00"]
We'll be adding other data types soon. In the meantime, we suggest representing other data types as either strings or integers.
Whether this attribute can be used as part of a BM25 full-text
search. Requires the string
or []string
type,
and by default, BM25-enabled attributes are not filterable. You can
override this by setting filterable: true
.
Can either be a boolean for default settings, or an object with the following optional fields:
language
(string): The language of the text. Defaults to english
. See: Supported languagesstemming
(boolean): Language-specific stemming for the text. Defaults to false
(i.e. do not stem).remove_stopwords
(boolean): Removes common words from the text based on language
. Defaults to true
(i.e. remove common words).case_sensitive
(boolean): Whether searching is case-sensitive. Defaults to false
(i.e. case-insensitive).tokenizer
(string): How to convert the text to a list of tokens. Defaults to word_v1
. See: Supported tokenizersIf you require other types of full-text search options, please contact us.
Whether the upserted vectors are of type f16
or f32
.
To use f16
vectors, this field needs to be explicitly specified in the schema
when first creating (i.e. upserting to) a namespace.
Example: "vector": {"type": [512]f16, "ann": true}
New attributes can be added with an upsert. All
documents prior to the write will have the attribute set to null
.
In most cases, the schema is inferred from the data you upsert. However, as part
of an upsert, you can choose to specify the schema
for
attributes through above parameters (i.e. to use UUID values or enable BM25
full-text indexing).
We support online, in-place changes of the filterable
and full_text_search
settings, by setting the schema in an upsert.
Other index settings changes, attribute type changes, and attribute deletions currently cannot be done in-place. Consider exporting documents and upserting into a new namespace if you require a schema change.
After enabling the filterable
setting for an attribute, or adding/updating a full-text index, the index needs time
to build before queries that depend on the index can be executed. turbopuffer will respond with HTTP status 202 to queries that depend on an index that is not yet built.
To retrieve the current schema for a namespace, make a GET
request to /v1/namespaces/:namespace/schema
.
turbopuffer currently supports language-aware stemming and stopword removal for full-text search. The following languages are supported:
arabic
danish
dutch
english
(default)finnish
french
german
greek
hungarian
italian
norwegian
portuguese
romanian
russian
spanish
swedish
tamil
turkish
Other languages can be supported by contacting us.
word_v1
(default)word_v0
pre_tokenized_array
The word_v1
tokenizer forms tokens from contiguous sequences of alphanumeric
codepoints and sequences of emoji codepoints that form a single glyph.
Codepoints that are neither alphanumeric nor an emoji are discarded. Codepoints
are classified according to v10.0 of the Unicode specification.
The word_v0
tokenizer works like the word_v1
tokenizer, except that emoji
codepoints are discarded.
The pre_tokenized_array
tokenizer is a special tokenizer that indicates that
you want to perform your own tokenization. This tokenizer can only be used on
attributes of type []string
; each string in the array is interpreted as a
token. When this tokenizer is active, queries using the BM25
or
ContainsAllTokens
operators must supply a query operand of type []string
rather than string
; each string in the array is interpreted as a token. Tokens
are always matched case sensitively, without stemming or stopword removal. You
cannot specify language
, stemming: true
, remove_stopwords: true
, or
case_sensitive: false
when using this tokenizer.
Other tokenizers can be supported by contacting us.