Logo

Upsert documents

POST /v1/vectors/:namespace

Creates, updates, or deletes documents.

Latency

Upsert latency
500kb docs

Percentile

Latency

p50
285ms
p90
370ms
p99
688ms
MAX
1250ms

Writes are consistent and thus immediately visible to queries.

The :namespace parameter identifies a set of documents. Within a namespace, documents are uniquely referred to by their ID. Upserting a document will overwrite any existing document with the same ID.

Namespaces are created when the first document is inserted.

For performance, we recommend creating a namespace per isolated document space instead of filtering when possible.

Each upsert can have up to a maximum payload size of 256 MB. For performance, we recommend writing in large batches for maximum throughput, to account for the latency of writing to object storage.

If this call returns OK, data is guaranteed to be durably written to object storage. You can read more about how upserts work on the Architecture page.

If low latency on upserts is a critical blocker for you, it can be improved dramatically. Just contact us.

Parameters

ids arrayrequired unless {upserts, copy_from_namespace} is set

Document IDs are stored as unsigned 64-bit integers, 128-bit UUIDs, or strings, depending on what's passed in the first request. Mixing ID types is not supported.

UUIDs serialize as a string, and require passing uuid as the id type in the schema of the first upsert to tell turbopuffer to parse the UUID and store/index it in an optimized format.


vectors arrayrequired unless {upserts, copy_from_namespace} is set

Must be the same length as the ids field. Each element is an array of numbers representing a vector. To delete one or more vectors, pass null in the vectors field.

Vector elements are stored as 32-bit floats. We intend to support several other formats.

Each vector in the namespace must have the same number of dimensions.


attributes object

Documents can optionally include attributes, which are used to filter search results. Attributes are key/value mappings, where keys are strings, and values are a supported type.

This parameter is an object where the keys are the attribute names, and the values are arrays of attribute values. Each array must be the same length as the ids field. When a document doesn't have a value for a given attribute, pass null.

Attribute names id and vector are reserved, and an error will be returned if they are set.


distance_metric stringrequired unless copy_from_namespace is set

The function used to calculate vector similarity. Possible values are cosine_distance or euclidean_squared.


copy_from_namespace string

Copy all documents from a namespace into this namespace. This operation is currently limited to copying within the same region and organization. The initial request currently cannot make schema changes or contain documents. Contact us if you need any of this.

Copying is billed at a 50% write discount which stacks with the up to 50% discount for batched writes. This is a faster, cheaper alternative to re-upserting documents for backups and namespaces that share documents.


schema object

By default, the schema is inferred from the passed data. See Defining the Schema below for details.


upserts object

Instead of specifying the upserts in a column-based format, you can use this optional param to specify them in a row-based format, if that's more convenient (there's no difference in behavior).

Each upsert in this list should specify an id, and optionally specify a vector and attributes, as defined above. If vector is not provided, or has value null, the operation is considered a delete.

Attribute keys must have consistent value types. For example, if a document is upserted containing attribute key foo with a string value, all future documents that specify foo must also use a string value. We're actively working on tooling to support value type migrations (and overall schema management).

Defining the Schema

The schema is used to configure type and indexing behavior. Setting the schema field is optional and by default types are automatically inferred from the passed data. Manually defining the schema is needed for:

  • UUID values, which serialize as strings, but are stored in turbopuffer as an optimized format
  • Enabling BM25 full-text search for a string attribute
  • Disabling indexing for an attribute (for cost savings)

The format of the schema object is described in the schema documentation.

To enable BM25 full-text search over a field, you must set the full_text_search key in the schema on the first upsert to the namespace.

{
  "ids": [1, 2, 3, 4],
  "vectors": [[0.1, 0.1], [0.2, 0.2], [0.3, 0.3], [0.4, 0.4]],
  "attributes": {
    "text": ["the fox is quick and brown", "fox jumped over the lazy dog", "the dog is lazy and brown", "the dog is a fox"],
    "more-text": ["hello", "you", "cool", "walrus"],
    "string": ["fox", "fox", "dog", "narwhal"],
  },
  "distance_metric": "cosine_distance",
  "schema": {
    "text": {
      "type": "string",
      "full_text_search": {
        /* required for stemming & stopword removal. defaults to english */
        "language": "english",
        /* language specific stemming helps correct for plurals, i.e. 'walrus' and 'walruses' both searchable as 'walrus' */
        "stemming": false,
        /* removes common words like 'the', 'and', 'a', etc. */
        "remove_stopwords": true
        /* if case sensitivity is enabled, stemming & stopword removal no longer allowed */
        "case_sensitive": false,
      }
    },
    "more-text": {
      "type": "string",
      "full_text_search": true, // shorthand for default settings shown above,
      "filterable": false, // don't index this attribute for filtering
    }
  }
}

You can mark any attribute as "filterable": false in the schema if you don't want it to be indexed for filtering. This reduces the cost of the attribute by 50% (see pricing), but means you can't filter on it.

Examples

Document Update or Insert

Bulk document operations use a column-oriented layout for documents, ids, and attributes.

// Request payload
{
  "ids": [1, 2, 3, 4],
  "vectors": [[0.1, 0.1], [0.2, 0.2], [0.3, 0.3], [0.4, 0.4]],
  "attributes": {
    "my-string": ["one", null, "three", "four"],
    "my-uint": [12, null, 84, 39],
    "my-bool": [true, null, false, true],
    "my-string-array": [["a", "b"], ["b", "d"], [], ["c"]]
  },
  "distance_metric": "cosine_distance"
}

// Response payload
{
  "status": "OK"
}

Document Deletion

Documents can be deleted by upserting a document ID to null.

// Request payload
{
  "ids": [2, 3],
  "vectors": [null, null]
}

// Response payload
{
  "status": "OK"
}

Deleting documents that match a specific filter can be done using the upsert endpoint in conjunction with filter-only queries. Specifically, by paginating over all the IDs matching a filter, then deleting those IDs.

Row-based API

The upsert operations can also be specified in the following format.

// Request payload
{
  "distance_metric": "cosine_distance",
  "upserts": [
    {
      "id": 1,
      "vector": [0.1, 0.1],
      "attributes": {
          "my-string": "one",
          "my-uint": 12,
          "my-bool": true,
          "my-string-array": ["a", "b"]
      }
    },
    {
      "id": 2,
      "vector": [0.2, 0.2],
      "attributes": {
          "my-string-array": ["b", "d"]
      }
    },
    {
      "id": 3,
      "vector": [0.3, 0.3],
      "attributes": {
          "my-string": "three",
          "my-uint": 84
      }
    },
    {
      "id": 4,
      "vector": [0.4, 0.4],
      "attributes": {
          "my-string": "four",
          "my-uint": 39,
          "my-string-array": ["c"]
      }
    }
  ]
}

// Response payload
{
"status": "OK"
}

UUID values

UUIDs can be stored and indexed more efficiently by setting the attribute or ID type to uuid. Each UUID value is billed as 16 bytes, instead of 36 bytes for the typical string representation of a UUID.

Setting the type via the schema field is required to disambiguate between a UUID and a regular string value. UUID values can be parsed from any hex representation of a UUID, with optional hyphens, e.g.

  • 8d724c16-f9a5-4d99-84a8-006ddadea956
  • 8d724c16f9a54d9984a8006ddadea956
  • 8D724C16-F9A5-4D99-84A8-006DDADEA956

UUIDs returned in API responses are always formatted lowercase with hyphens.

// Request payload
{
  "distance_metric": "cosine_distance",
  "upserts": [
    {
      "id": "8d724c16-f9a5-4d99-84a8-006ddadea956",
      "vector": [0.1, 0.1],
      "attributes": {
        "my-uuid": "8342a8bb-bdc3-453b-b4f3-95c16585f70c",
        "my-uuid-array": ["e0b91c99-b462-4b5c-8345-0925726100e2", "c3f5c19e-77a7-42e8-a64e-8b1e3fb03522"]
      }
    }
  ],
  "schema": {
    "id": "uuid",
    "my-uuid": "uuid",
    "my-uuid-array": "[]uuid"
  }
}

// Response payload
{
"status": "OK"
}

© 2024 turbopuffer Inc.
Privacy PolicyTerms of service