Conditional writes shipped

POST /v2/namespaces/:namespace

Creates, updates, or deletes documents.

Latency

Upsert latency
500kb docs

Percentile

Latency

p50
285ms
p90
370ms
p99
688ms

A :namespace is an isolated set of documents and is implicitly created when the first document is inserted. Namespace names must match [A-Za-z0-9-_.]{1,128}.

We recommend creating a namespace per isolated document space instead of filtering when possible. Large batches of writes are highly encouraged to maximize throughput and minimize cost. Write requests can have a payload size of up to 256 MB. See Performance.

Within a namespace, documents are uniquely referred to by their ID. Document IDs are unsigned 64-bit integers, 128-bit UUIDs, or strings.

turbopuffer supports the following types of writes:

Request

upsert_rows array

Upserts documents in a row-based format. Each row is an object with an id document ID, and any number of other attribute fields.

A namespace may or may not have a vector index. If it does, all documents must include a vector field. Otherwise, the vector key should be omitted.

Example: [{"id": 1, "vector": [1, 2, 3], "name": "foo"}, {"id": 2, "vector": [4, 5, 6], "name": "bar"}]


upsert_columns object

Upserts documents in a column-based format. This field is an object, where each key is the name of a column, and each value is an array of values for that column.

The id key is required, and must contain an array of document IDs. The vector key is required if the namespace has a vector index. Other keys will be stored as attributes.

Each column must be the same length. When a document doesn't have a value for a given column, pass null.

Example: {"id": [1, 2], "vector": [[1, 2, 3], [4, 5, 6]], "name": ["foo", "bar"]}


patch_rows array

Patches documents in a row-based format. Identical to upsert_rows, but instead of overwriting entire documents, only the specified keys are written.

The vector key currently cannot be patched. You currently need to retrieve and upsert the entire document.

Any patches to IDs that don't already exist in the namespace will be ignored; patches will not create any missing documents.

Example: [{"id": 1, "name": "baz"}, {"id": 2, "name": "qux"}]

Patches are billed for the size of the patched attributes (not the full written documents), plus the cost of one query per write request (to read all the patched documents touched by the request).


patch_columns object

Patches documents in a column-based format. Identical to upsert_columns, but instead of overwriting entire documents, only the specified keys are written.

The vector key currently cannot be patched. You currently need to retrieve and upsert the entire document.

Any patches to IDs that don't already exist in the namespace will be ignored; patches will not create any missing documents.

Example: {"id": [1, 2], "name": ["baz", "qux"]}


deletes array

Deletes documents by ID. Must be an array of document IDs.

Example: [1, 2, 3]


upsert_condition object

Makes each write in upsert_rows and upsert_columns conditional on the upsert_condition being satisfied for the document with the corresponding ID.

The upsert_condition is evaluated before each write, using the current value of the document with the matching ID.

  • If the document exists and the condition is met, the write is applied (i.e. the document is updated).
  • If the document exists and the condition is not met, the write is skipped.
  • If the document does not exist, the write is applied unconditionally (i.e. the document is created).

The condition syntax matches the filters parameter in the query API, with an additional feature: you can reference the new value being written using $ref_new references. These look like {"$ref_new": "attr_123"} and can be used in place of value literals.

Example: ["Or", [["updated_at", "Lt", {"$ref_new": "updated_at"}], ["updated_at", "Eq", null]]]

This condition ensures that each upsert is only processed if the new document value has a newer "updated_at" timestamp than its current version.


patch_condition object

Like upsert_condition, but for patch_rows and patch_columns.

Any patches to IDs that don't already exist in the namespace will be ignored without evaluating the condition; patches will not create any missing documents.


delete_condition object

Like upsert_condition, but for deletes.

$ref_new references are given a null value for all attributes.

Does not apply to delete_by_filter. Prefer this over delete_by_filter when the set of IDs to conditionally delete is known ahead of time.


delete_by_filter object

You can delete documents that match a filter using delete_by_filter. It has the same syntax as the filters parameter in the query API.

If delete_by_filter is used in the same request as other write operations, delete_by_filter will be applied before the other operations. This allows you to delete rows that match a filter before writing new row with overlapping IDs. Note that patches to any deleted rows are ignored.

delete_by_filter is different from deletes with a delete_condition:

  • delete_by_filter: searches across the namespace for any matching document IDs, deleting all matches that it finds.
  • delete + delete_condition: only evaluates the condition on the IDs identified in deletes.

delete_condition does not apply to delete_by_filter.

Example: ["page_id", "Eq", 123]

delete_by_filter is billed the same as normal deletes, plus the cost of one query per write request (to determine which IDs to delete).


distance_metric cosine_distance | euclidean_squaredrequired unless copy_from_namespace is set or no vector is set

The function used to calculate vector similarity. Possible values are cosine_distance or euclidean_squared.

cosine_distance is defined as 1 - cosine_similarity and ranges from 0 to 2. Lower is better.

euclidean_squared is defined as sum((x - y)^2). Lower is better.


copy_from_namespace string

Copy all documents from another namespace into this one. The destination namespace you are copying into must be empty. This operation is currently limited to copying within the same region and organization. The initial request currently cannot make schema changes or contain documents.

Copying is billed at a 50% write discount which stacks with the up to 50% discount for batched writes. This is a faster, cheaper alternative to re-upserting documents for backups and namespaces that share documents.

Example: "source-namespace"


schema object

By default, the schema is inferred from the passed data. See Schema below.

There are cases where you want to manually specify the schema because turbopuffer can't automatically infer it. For example, to specify UUID types, configure full-text search for an attribute, or disable filtering for an attribute.

Example: {"permissions": "[]uuid", "text": {"type": "string", "full_text_search": true}, "encrypted_blob": {"type": "string", "filterable": false}}


encryption objectoptional

Only available as part of our scale and enterprise plans.

Setting a Customer Managed Encryption Key (CMEK) will encrypt all data in a namespace using a secret coming from your cloud KMS. Once set, all subsequent writes to this namespace will be encrypted, but data written prior to this upsert will be unaffected.

Currently, turbopuffer does not re-encrypt data when you rotate key versions, meaning old data will remain encrypted using older key verisons, while fresh writes will be encrypted using the latest versions. Revoking old key versions will cause data loss. To re-encrypt your data using a more recent key, use the export API to re-upsert into a new namespace.

Example (GCP): { "cmek": { "key_name": "projects/myproject/locations/us-central1/keyRings/EXAMPLE/cryptoKeys/KEYNAME" } }

Example (AWS): { "cmek": { "key_name": "arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012" } }

Response

rows_affected number

The total number of rows affected by the write request (sum of upserted, patched, and deleted rows).

rows_upserted number

The number of rows upserted by the write request. Only present when upsert_rows or upsert_columns is used.

rows_patched number

The number of rows patched by the write request. Only present when patch_rows or patch_columns is used.

When using patch_condition, this reflects only the rows where the condition was met and the patch was applied. Other patches were skipped.

rows_deleted number

The number of rows deleted by the write request. Only present when deletes or delete_by_filter is used.

When using delete_condition, this reflects only the rows where the condition was met and the deletion occurred. Other deletes were skipped.

rows_remaining boolean

delete_by_filter is presently limited to deleting a maximum of 5M documents per write request. This ensures indexing and consistent reads can keep up with deletes. If this response field is set to true there are more documents that match the delete_by_filter. You should issue another potentially duplicate request to delete additional matching documents.

billing object

The billable resources consumed by the write. The object contains the following fields:

  • billable_logical_bytes_written (uint): the number of logical bytes written to the namespace
  • query (object, optional): query billing information when the write involves a query (for a conditional write or delete_by_filter):
    • billable_logical_bytes_queried (uint): the number of logical bytes processed by queries
    • billable_logical_bytes_returned (uint): the number of logical bytes returned by queries

Attributes

Documents are composed of attributes. All documents must have a unique id attribute. Attribute names can be up to 128 characters in length and must not start with a $ character.

By default, attributes are indexed and thus queries can filter or sort by them. To disable indexing for an attribute, set filterable to false in the schema for a 50% discount and improved indexing performance. The attribute can still be returned, but not used for filtering or sorting.

Attributes must have consistent value types, and are nullable. The type is inferred from the first occurrence of the attribute. Certain non-inferrable types, e.g. uuid or datetime, must be specified in the schema.

Some limits apply to attribute sizes and number of attribute names per namespace. See Limits.

Vectors

Vectors are attributes with name vector, encoded as either an JSON array of f16 or f32 decimals, or as a base64-encoded string. To use f16, the vector field must be explicitly specified in the schema when first creating the namespace.

Each vector in the namespace must have the same number of dimensions.

If using the base64 encoding, the vector must be serialized in little-endian float32 or float16 binary format, then base64-encoded. The base64 string encoding can be more efficient on both the client and server.

A namespace can be created without vectors. In this case, the vector key must be omitted from all write requests.

Schema

turbopuffer maintains a schema for each namespace with type and indexing behaviour for each attribute. By default, types are automatically inferred from the passed data and every attribute is indexed. To inspect the schema, use the metadata endpoint.

To customize indexing behavior or to specify types that cannot be automatically inferred (e.g. uuid), you can pass a schema object in a write request. This can be done on every write, or only the first; there's no performance difference. If a new attribute is added, this attribute will default to null for any documents that existed before the attribute was added.

Changing the attribute type of an existing attribute is currently an error.

For an example, see Configuring the schema.

type stringrequired true

The data type of the attribute. Supported types:

  • string: String
  • int: Signed integer (i64)
  • uint: Unsigned integer (u64)
  • float: Floating-point number (f64)
  • uuid: 128-bit UUID
  • datetime: Date and time
  • bool: Boolean
  • []string: Array of strings
  • []int: Array of signed integers
  • []uint: Array of unsigned integers
  • []float: Array of floating-point numbers
  • []uuid: Array of UUIDs
  • []datetime: Array of dates and times

All attributes are nullable, except for id.

string, int and bool types and their array variants can be inferred from the write payload. Other types, such as uint or uuid must be set explicitly in the schema. See UUID values for an example.

datetime values should be provided as an ISO 8601 formatted string with a mandatory date and optional time and time zone. Internally, these values are converted to UTC (if the time zone is specified) and stored as a 64-bit integer representing milliseconds since the epoch.

Example: ["2015-01-20", "2015-01-20T12:34:56", "2015-01-20T12:34:56-04:00"]


filterable booleandefault: true (false if full-text search or regex is enabled)

Whether or not the attribute can be used in filters/WHERE clauses. Filtered attributes are indexed into an inverted index. At query-time, the filter evaluation is recall-aware when used for vector queries.

Unfiltered attributes don't have an index built for them, and are thus billed at a 50% discount (see pricing).


regex booleandefault: false

Whether to enable Regex filters on this attribute. If set, filterable defaults to false; you can override this by setting filterable: true.


full_text_search boolean | objectdefault: false

Whether this attribute can be used as part of a BM25 full-text search. Requires the string or []string type, and by default, BM25-enabled attributes are not filterable. You can override this by setting filterable: true.

Can either be a boolean for default settings, or an object with the following optional fields:

  • language (string): The language of the text. Defaults to english. See: Supported languages
  • stemming (boolean): Language-specific stemming for the text. Defaults to false (i.e. do not stem).
  • remove_stopwords (boolean): Removes common words from the text based on language. Defaults to true (i.e. remove common words).
  • case_sensitive (boolean): Whether searching is case-sensitive. Defaults to false (i.e. case-insensitive).
  • tokenizer (string): How to convert the text to a list of tokens. Defaults to word_v1. See: Supported tokenizers
  • k1 (float): Term frequency saturation parameter for BM25 scoring. Must be greater than zero. Defaults to 1.2. See: Advanced tuning
  • b (float): Document length normalization parameter for BM25 scoring. Must be in the range [0.0, 1.0]. Defaults to 0.75. See: Advanced tuning

If you require other types of full-text search options, please contact us.


vector objectdefault: {'type': [dims]f32, 'ann': true}

Whether the upserted vectors are of type f16 or f32.

To use f16 vectors, this field needs to be explicitly specified in the schema when first creating (i.e. writing to) a namespace.

Example: "vector": {"type": [512]f16, "ann": true}

Updating attributes

We support online, in-place changes of the filterable and full_text_search setting for an attribute. The write does not need to include any documents, i.e. {"schema": ...} is supported, provided the namespace already exists.

Other index settings changes, attribute type changes, and attribute deletions currently cannot be done in-place. Consider exporting documents and upserting into a new namespace if you require a schema change.

After enabling the filterable or full_text_search setting for an existing attribute, the index needs time to build before queries that depend on the index can be executed. turbopuffer will respond with HTTP status 202 to queries that depend on an index that is not yet built.

Changing full-text search parameters also requires that the index be rebuilt. turbopuffer will do this automatically in the background, during which time queries will continue returning results using the previous full-text search settings.

Examples

Row-based writes

Row-based writes may be more convenient than column-based writes. You can pass any combination of upsert_rows, patch_rows, deletes, and delete_by_filter to the write request.

If the same document ID appears multiple times in the request, the request will fail with an HTTP 400 error.

Configuring the schema

The schema can be passed on writes to manually configure attribute types and indexing behavior. A few examples where manually configuring the schema is needed:

  • UUID values serialized as strings can be stored in turbopuffer in an optimized format.
  • Enabling full-text search or regex indexing for string attributes.
  • Disabling indexing/filtering (filterable:false) on an attribute, for a 50% discount and improved indexing performance.

An example of (1), (2), and (3):

Column-based writes

Bulk document operations should use a column-oriented layout for best performance. You can pass any combination of upsert_columns, patch_columns, deletes, and delete_by_filter to the write request.

If the same document ID appears multiple times in the request, the request will fail with an HTTP 400 error.

Conditional writes

To make writes conditional, use the upsert_condition, patch_condition, and delete_condition parameters. These let you specify a condition that must be satisfied for the write operation to each document to proceed.

Conditions are evaluated before each write, using the current value of the document with the matching ID.

  • If the document exists and the condition is met, the write is applied.
  • If the document exists and the condition is not met, the write is skipped.
  • If the document does not exist, the write is applied unconditionally for upserts and skipped unconditionally for patches and deletes.

The operation will return the actual number of documents written (upserted, patched, or deleted).

Internally, the operation performs a query (one per batch) to determine which documents match the condition, so it is billed as both a query and a write operation. However, if the condition is not met for a given document, that write is skipped and not billed.

The condition syntax matches the filters parameter in the query API, with an additional feature: you can reference the new value being written using $ref_new references. These look like {"$ref_new": "attr_123"} and can be used in place of value literals. This allows the condition to vary by document in a multi-document write request.

Conditional deletes are distinct from delete_by_filter, which should be used when the set of IDs to conditionally delete is not known ahead of time.

Delete by filter

To delete documents that match a filter, use delete_by_filter. This operation will return the actual number of documents removed.

Because the operation internally issues a query to determine which documents to delete, this operation is billed as both a query and a write operation.

If delete_by_filter is used in the same request as other write operations, delete_by_filter will be applied before the other operations. This allows you to delete rows that match a filter before writing new row with overlapping IDs. Note that patches to any deleted rows are ignored.

delete_by_filter has the same syntax as the filters parameter in the query API.

Follow
Blog