Reads or updates the namespace schema.
turbopuffer maintains a schema for each namespace with type and indexing behaviour for each attribute.
The schema can be modified as you write documents.
A basic schema will be automatically inferred from the upserted data. You can explicitly configure a schema to specify types that can't be inferred (e.g. UUIDs) or to control indexing behaviour (e.g. enabling full-text search for an attribute). If any parameters are specified for an attribute, the type for that attribute must also be explicitly defined.
Every attribute can have the following fields in its schema specified at write time:
The data type of the attribute. Supported types:
string
: Stringint
: Signed integer (i64)uint
: Unsigned integer (u64)uuid
: 128-bit UUIDdatetime
: Date and timebool
: Boolean[]string
: Array of strings[]int
: Array of signed integers[]uint
: Array of unsigned integers[]uuid
: Array of UUIDs[]datetime
: Array of dates and timesAll attribute types are nullable by default, except id
and vector
which are
required. vector
will become an optional attribute soon. If you need a
namespace without a vector, simply set vector
to a random float.
Most types can be inferred from the write payload, except uuid
, datetime
,
and their array variants, which all need to be set explicitly in the schema. See
UUID values for an example.
By default, integers use a 64-bit signed type (int
). To use an unsigned type, set
the attribute type to uint
explicitly in the schema.
datetime
values should be provided as an ISO 8601 formatted string with a
mandatory date and optional time and time zone. Internally, these values are
converted to UTC (if the time zone is specified) and stored as a 64-bit integer
representing milliseconds since the epoch.
Example: ["2015-01-20", "2015-01-20T12:34:56", "2015-01-20T12:34:56-04:00"]
We'll be adding other data types soon. In the meantime, we suggest representing other data types as either strings or integers.
Whether or not the attribute can be used in filters/WHERE clauses. Filtered attributes are indexed into an inverted index. At query-time, the filter evaluation is recall-aware when used for vector queries.
Unfiltered attributes don't have an index built for them, and are thus billed at a 50% discount (see pricing).
Whether this attribute can be used as part of a BM25 full-text
search. Requires the string
or []string
type,
and by default, BM25-enabled attributes are not filterable. You can
override this by setting filterable: true
.
Can either be a boolean for default settings, or an object with the following optional fields:
language
(string): The language of the text. Defaults to english
. See: Supported languagesstemming
(boolean): Language-specific stemming for the text. Defaults to false
(i.e. do not stem).remove_stopwords
(boolean): Removes common words from the text based on language
. Defaults to true
(i.e. remove common words).case_sensitive
(boolean): Whether searching is case-sensitive. Defaults to false
(i.e. case-insensitive).tokenizer
(string): How to convert the text to a list of tokens. Defaults to word_v1
. See: Supported tokenizersk1
(float): Term frequency saturation parameter for BM25 scoring. Must be greater than zero. Defaults to 1.2
. See: Advanced tuningb
(float): Document length normalization parameter for BM25 scoring. Must be in the range [0.0, 1.0]
. Defaults to 0.75
. See: Advanced tuningIf you require other types of full-text search options, please contact us.
Whether the upserted vectors are of type f16
or f32
.
To use f16
vectors, this field needs to be explicitly specified in the schema
when first creating (i.e. writing to) a namespace.
Example: "vector": {"type": [512]f16, "ann": true}
New attributes can be added with a write or an explicit
schema update. All documents prior to the schema update will have the
attribute set to null
.
In most cases, the schema is inferred from the data you write. However, as part
of a write, you can choose to specify the schema
for
attributes through above parameters (i.e. to use UUID values or enable BM25
full-text indexing).
We support online, in-place changes of the filterable
and full_text_search
settings, by setting the schema in a write or by sending
an explicit schema update.
Other index settings changes, attribute type changes, and attribute deletions currently cannot be done in-place. Consider exporting documents and upserting into a new namespace if you require a schema change.
After enabling the filterable
setting for an attribute, or adding/updating a full-text index, the index needs time
to build before queries that depend on the index can be executed. turbopuffer will respond with HTTP status 202 to queries that depend on an index that is not yet built.
To retrieve the current schema for a namespace, make a GET
request to /v1/namespaces/:namespace/schema
.
To update the schema for a namespace without a write, make a POST
request to /v1/namespaces/:namespace/schema
.
For example, to change an attribute called my-text
to unfilterable:
turbopuffer currently supports language-aware stemming and stopword removal for full-text search. The following languages are supported:
arabic
danish
dutch
english
(default)finnish
french
german
greek
hungarian
italian
norwegian
portuguese
romanian
russian
spanish
swedish
tamil
turkish
Other languages can be supported by contacting us.
word_v2
word_v1
(default)word_v0
pre_tokenized_array
The word_v2
tokenizer forms tokens from ideographic codepoints, contiguous
sequences of alphanumeric codepoints, and sequences of emoji codepoints that
form a single glyph. Codepoints that are not alphanumeric, ideographic, or an
emoji are discarded. Codepoints are classified according to Unicode v16.0.
The word_v1
tokenizer works like the word_v2
tokenizer, except that
ideographic codepoints are treated as alphanumeric codepoint. Codepoints are
classified according to Unicode v10.0.
The word_v0
tokenizer works like the word_v1
tokenizer, except that emoji
codepoints are discarded.
The pre_tokenized_array
tokenizer is a special tokenizer that indicates that
you want to perform your own tokenization. This tokenizer can only be used on
attributes of type []string
; each string in the array is interpreted as a
token. When this tokenizer is active, queries using the BM25
or
ContainsAllTokens
operators must supply a query operand of type []string
rather than string
; each string in the array is interpreted as a token. Tokens
are always matched case sensitively, without stemming or stopword removal. You
cannot specify language
, stemming: true
, remove_stopwords: true
, or
case_sensitive: false
when using this tokenizer.
Other tokenizers can be supported by contacting us.
The BM25 scoring algorithm involves two parameters that can be tuned for your workload:
k1
controls how quickly the impact of term frequency saturates. When k1
is
close to zero, term frequency is effectively ignored when scoring a document.
When k1
is close to infinity, term frequency contributes nearly
linearly to the score.
The default value, 1.2
, means that increasing term frequency in a document
boosts heavily to start but quickly results in diminishing returns.
b
controls document length normalization. When b
is 0.0
, documents are
treated equally regardless of length, which allows long articles tend to
dominate due to sheer volume of terms. When b
is 1.0
, documents are
boosted or penalized based on the ratio of their length to the average
document length in the corpus.
The default value, 0.75
, controls for length bias without eliminating it
entirely (long documents are often legitimately more relevant).
The default values are suitable for most applications. Tuning is typically
required only if your corpus consists of extremely short texts like tweets
(decrease k1
and b
) or extremely long texts like legal documents (increase
k1
and b
).
To tune these parameters, we recommend an empirical approach: build a set of evals, and choose the parameter values that maximize performance on those evals.