Logo

Upsert & Delete Documents

POST /v1/vectors/:namespace

Creates, updates, or deletes documents.

Latency

Upsert latency
500kb docs

Percentile

Latency

p50
285ms
p90
370ms
p99
688ms

Writes are immediately visible to queries and are applied atomically. The :namespace is an isolated set of documents and is created when the first document is inserted.

Within a namespace, documents are uniquely referred to by their ID and conform to the namespace's schema. Upserting a document will overwrite any existing document with the same ID. The schema is automatically inferred, but can be configured to control type and indexing behavior.

For performance, we recommend creating a namespace per isolated document space instead of filtering when possible. See Performance.

Each upsert can have a maximum payload size of 256 MB. For performance, we recommend writing in large batches for maximum throughput, to account for the latency of writing to object storage.

If this endpoint returns OK, data is guaranteed to be durably written to object storage. You can read more about how upserts work on the Architecture page.

Upserts can be columnar by using ids, vectors, and attributes or row-based by using upserts.

Deletes are performed by sending a null vector.

Attributes must have consistent value types. For example, if a document is upserted containing attribute key foo with a string value, all future documents that specify foo must also use a string value (or null).

Parameters

ids arrayrequired unless {upserts, copy_from_namespace} is set

Document IDs are inferred as unsigned 64-bit integers, 128-bit UUIDs, or strings on the first upsert. Mixing ID types is not supported.

UUIDs serialize as a string, and require passing uuid as the id type in the Schema of the first upsert to tell turbopuffer to parse the UUID and store it in an optimized format.

Example: [1, 2, 3]


vectors array<f32>[array<f32>] | array<f32>required unless {upserts, copy_from_namespace} is set

Must be a nested array of the same length as the ids field or a flat array of length ids * vector_dimensions.

To delete one or more vectors, pass null in the vectors field.

Each vector in the namespace must have the same number of dimensions.

Example: [[1, 2, 3], [4, 5, 6], null]


distance_metric cosine_distance | euclidean_squaredrequired unless copy_from_namespace is set

The function used to calculate vector similarity. Possible values are cosine_distance or euclidean_squared.

cosine_distance is defined as 1 - cosine_similarity and ranges from 0 to 2. Lower is better.

euclidean_squared is defined as sum((x - y)^2). Lower is better.


attributes object

Documents can optionally include attributes, which are used to filter search results.

Attributes are key/value mappings, where keys are strings, and values are a supported type. Value types are inferred on upserts.

This parameter is an object where the keys are the attribute names, and the values are arrays of attribute values. Each array must be the same length as the ids field.

When a document doesn't have a value for a given attribute, pass null.

If a new attribute is added, the new attribute will default to null for past documents.

Some limits apply to attribute sizes and number of attribute names per namespace. See Limits.

Example: {"color": [null, "red", "blue"], "size": [10, 20, null]}


copy_from_namespace string

Copy all documents from a namespace into this namespace. This operation is currently limited to copying within the same region and organization. The initial request currently cannot make schema changes or contain documents. Contact us if you need any of this.

Copying is billed at a 50% write discount which stacks with the up to 50% discount for batched writes. This is a faster, cheaper alternative to re-upserting documents for backups and namespaces that share documents.

Example: "source-namespace"


schema object

By default, the schema is inferred from the passed data. See Defining the Schema below for details.

There are cases where you want to manually specify the schema because turbopuffer can't automatically infer it. For example, to specify UUID types, configure full-text search for an attribute, or disable filtering for an attribute.

Example: {"permissions": "[]uuid", "text": {"type": "string", "full_text_search": true}, "encrypted_blob": {"type": "string", "filterable": false}}


upserts object

Instead of specifying the upserts in a column-based format, you can use this optional param to specify them in a row-based format, if that's more convenient (there's no difference in behavior).

Each upsert in this list should specify an id, and optionally specify a vector and attributes, as defined above. If vector is not provided, or has value null, the operation is considered a delete.

Example: [{"id": "1", "vector": [1, 2, 3], "attributes": {"color": "red", "size": 10}}, {"id": "2", "attributes": {"color": "blue", "size": 20}}

Examples

Update or Insert

Bulk document operations use a column-oriented layout for documents, ids, and attributes. See below for the row-based API. Each batch is applied atomically, i.e. you won't see partial query results for a batch.

// Request payload
{
  "ids": [1, 2, 3, 4],
  "vectors": [[0.1, 0.1], [0.2, 0.2], [0.3, 0.3], [0.4, 0.4]],
  "attributes": {
    "my-string": ["one", null, "three", "four"],
    "my-uint": [12, null, 84, 39],
    "my-bool": [true, null, false, true],
    "my-string-array": [["a", "b"], ["b", "d"], [], ["c"]]
  },
  "distance_metric": "cosine_distance"
}

// Response payload
{
  "status": "OK"
}

Delete

Documents can be deleted by upserting with the vector set to null. Since batches are applied atomically, if you delete a document and insert another in the same upsert, those operations will always be applied together.

// Request payload
{
  "ids": [2, 3],
  "vectors": [null, null]
}

// Response payload
{
  "status": "OK"
}

Delete by filter

To delete documents that match a filter, paginate over a query, and issue the deletes for each matching document:

ns = tpuf.Namespace("delete_by_query_example")

ns.upsert([
  {'id': 0, 'vector': [2, 2], 'attributes': {'timestamp': 2}},
  {'id': 1, 'vector': [1, 1], 'attributes': {'timestamp': 1}},
  {'id': 2, 'vector': [1, 2], 'attributes': {'timestamp': 0}},
], distance_metric='cosine_distance')

last_id = None
while True:
  results = ns.query(
      top_k=1000,
      filters=["And", [
        ["timestamp", "Gte", 1],
        ["id", "Gte" if last_id is None else "Gt", last_id or 0]
      ]]
  )
  if not results or len(results) < 1000:
      break
  ns.delete([doc.id for doc in results])
  last_id = results[-1].id

assert [doc.id for doc in ns.vectors()] == [2]

Note that if any documents matching the query were inserted between the query and the delete they will not be deleted. If you need all the deletes to be atomically applied, collect all the ids in a list before issuing a single delete.

Schema

The schema is optionally set on upsert to configure type and indexing behavior. By default, types are automatically inferred from the passed data and every attribute is indexed. You can always GET the schema.

See the schema documentation for all type and indexing options available. A few examples where manually configuring the schema is needed:

  1. UUID values serialized as strings can be stored in turbopuffer in an optimized format
  2. Full-text search for a string attribute
  3. Disabling indexing/filtering (filterable:false) for an attribute, for a 50% discount and improved indexing performance.

You can choose to pass the schema on every upsert, or only the first. There's no performance difference. If an upsert adds a new attribute, it will imply that all previous documents have a null value for that attribute.

An example of (1), (2), and (3) on upsert:

{
  "ids": ["769c134d-07b8-4225-954a-b6cc5ffc320c", "3ad8c7b2-9c49-4ae5-819a-e014aef5c1ba", "611ea878-ed54-462b-82f2-10e5bb6e2110", "793afe55-ff77-4c64-9b9f-d26afd9faebe"],
  "vectors": [[0.1, 0.1], [0.2, 0.2], [0.3, 0.3], [0.4, 0.4]],
  "attributes": {
    "text": ["the fox is quick and brown", "fox jumped over the lazy dog", "the dog is lazy and brown", "the dog is a fox"], // inferred as string, and filterable: true
    "string": ["fox", "fox", "dog", "narwhal"], // inferred as string, and filterable: true
    "permissions": [ // inferred as []uuid and filterable: true
       ["ee1f7c89-a3aa-43c1-8941-c987ee03e7bc", "95cdf8be-98a9-4061-8eeb-2702b6bbcb9e"],
       ["bfa20d1c-d8bc-4ec3-b2c3-d8b5d3e034e0"],
       ["ee1f7c89-a3aa-43c1-8941-c987ee03e7bc", "bfa20d1c-d8bc-4ec3-b2c3-d8b5d3e034e0", "ee1f7c89-a3aa-43c1-8941-c987ee03e7bc", "95cdf8be-98a9-4061-8eeb-2702b6bbcb9e"],
       ["95cdf8be-98a9-4061-8eeb-2702b6bbcb9e"],
     ]
  },
  "distance_metric": "cosine_distance",
  "schema": {
     "id": "uuid",
    "text": {
      "type": "string",
      "full_text_search": true // sets filterable: false, and enables FTS with default settings
    },
    // `string` we are happy with defaults!
    "permissions": {
      "type": "[]uuid", // otherwise inferred as slower/more expensive []string
    }
  }
}

Row-based

As an alternative to the column-based API, you can specify the upserts in a row-based format:

// Request payload
{
  "distance_metric": "cosine_distance",
  "upserts": [
    {
      "id": 1,
      "vector": [0.1, 0.1],
      "attributes": {
          "my-string": "one",
          "my-uint": 12,
          "my-bool": true,
          "my-string-array": ["a", "b"]
      }
    },
    {
      "id": 2,
      "vector": [0.2, 0.2],
      "attributes": {
          "my-string-array": ["b", "d"]
      }
    }
  ]
}

// Response payload
{
"status": "OK"
}

© 2024 turbopuffer Inc.
Privacy PolicyTerms of service