Upsert documents

POST /v1/vectors/:namespace

Creates, updates, or deletes documents.

Latency

Upsert latency
500kb docs

Percentile

Latency

p50
285ms
p90
370ms
p99
688ms
MAX
1250ms

Writes are consistent and thus immediately visible to queries.

The :namespace parameter identifies a set of documents. Within a namespace, documents are uniquely referred to by their ID. Upserting a document will overwrite any existing document with the same ID.

Namespaces are created when the first document is inserted.

For performance, we recommend creating a namespace per isolated document space instead of filtering when possible.

Each upsert can have up to a maximum payload size of 256 MB. For performance, we recommend writing in large batches for maximum throughput, to account for the latency of writing to object storage.

If this call returns OK, data is guaranteed to be durably written to object storage. You can read more about how upserts work on the Architecture page.

If low latency on upserts is a critical blocker for you, it can be improved dramatically. Just contact us.

Warning: Queries may be slow during periods of high write throughput or after a large bulk import.

turbopuffer can handle >= 10,000 writes/s (WPS) per namespace, but indexing cannot currently keep up. This causes high query latency while performing bulk imports. When write throughput decreases (<= 100 per second) the indexer catches up, and queries will be fast.

Most use-cases do an initial bulk import, followed by queries with lower write throughput (<= 100 per second). For this use-case, it's not a problem. We are actively working to improve this limitation.

Parameters

ids arrayrequired unless upserts is set

Document IDs are stored as unsigned 64-bit integers or string ids, depending on what's passed in the first request. Mixing ID types is not supported.


vectors arrayrequired unless upserts is set

Must be the same length as the ids field. Each element is an array of numbers representing a vector. To delete one or more vectors, pass null in the vectors field.

Vector elements are stored as 32-bit floats. We intend to support several other formats.

Each vector in the namespace must have the same number of dimensions.


attributes object

Documents can optionally include attributes, which are used to filter search results. Attributes are key/value mappings. Keys are strings, and values can be strings, unsigned integers, or arrays of either. More value types will be added in the future.

This parameter is an object where the keys are the attribute names, and the values are arrays of attribute values. Each array must be the same length as the ids field. When a document doesn't have a value for a given attribute, pass null.

Attribute names id and vector are reserved, and an error will be returned if they are set.


distance_metric stringrequired

The function used to calculate vector similarity. Possible values are cosine_distance or euclidean_squared.


schema string

By default, the schema is inferred from the passed data. You may want to explicitly specify the schema to configure the indexing behavior. This is currently only required for BM25 full text search.

See Schema below for details.


upserts object

Instead of specifying the upserts in a column-based format, you can use this optional param to specify them in a row-based format, if that's more convenient (there's no difference in behavior).

Each upsert in this list should specify an id, and optionally specify a vector and attributes, as defined above. If vector is not provided, or has value null, the operation is considered a delete.

Attribute keys must have consistent value types. For example, if a document is upserted containing attribute key foo with a string value, all future documents that specify foo must also use a string value. We're actively working on tooling to support value type migrations (and overall schema management).

Schema

The schema is used to configure type and indexing behavior. It is optional and by default is automatically inferred from the passed data. At the moment, it is only required for specifying BM25 searchable fields. You must specify that the field is BM25 searchable by setting a bm25 key in the schema on the first upsert to the namespace.

Schema will soon be used to specify unindexed fields for cost savings, types that cannot be trivially inferred from the data, and customizing indexing behavior.

{
  "ids": [1, 2, 3, 4],
  "vectors": [[0.1, 0.1], [0.2, 0.2], [0.3, 0.3], [0.4, 0.4]],
  "attributes": {
    "text": ["the fox is quick and brown", "fox jumped over the lazy dog", "the dog is lazy and brown"],
    "more-text": ["hello", "you", "cool"],
    "string": ["fox", "fox", "dog"],
  },
  "distance_metric": "cosine_distance",
  "schema": {
    "text": {
      "type": "?string",
      "bm25": {
        /* required for stemming & stopword removal. defaults to english */
        "language": "english",
        /* language specific stemming helps correct for plurals, i.e. 'walrus' and 'walruses' both searchable as 'walrus' */
        "stemming": false,
        /* removes common words like 'the', 'and', 'a', etc. */
        "remove_stopwords": true
        /* if case sensitivity is enabled, stemming & stopword removal no longer allowed */
        "case_sensitive": false,
      }
    },
    "more-text": {
      "type": "?string",
      "bm25": true, // shorthand for default settings shown above
    }
  }
}

Examples

Document Update or Insert

Bulk document operations use a column-oriented layout for documents, ids, and attributes.

// Request payload
{
  "ids": [1, 2, 3, 4],
  "vectors": [[0.1, 0.1], [0.2, 0.2], [0.3, 0.3], [0.4, 0.4]],
  "attributes": {
    "my-string": ["one", null, "three", "four"],
    "my-uint": [12, null, 84, 39],
    "my-string-array": [["a", "b"], ["b", "d"], [], ["c"]]
  },
  "distance_metric": "cosine_distance"
}

// Response payload
{
  "status": "OK"
}

Document Deletion

Documents can be deleted by upserting a document ID to null.

// Request payload
{
  "ids": [2, 3],
  "vectors": [null, null]
}

// Response payload
{
  "status": "OK"
}

Deleting documents that match a specific filter can be done using the upsert endpoint in conjunction with filter-only queries. Specifically, by paginating over all the IDs matching a filter, then deleting those IDs.

Row-based API

The upsert operations can also be specified in the following format.

// Request payload
{
  "distance_metric": "cosine_distance",
  "upserts": [
    {
      "id": 1,
      "vector": [0.1, 0.1],
      "attributes": {
          "my-string": "one",
          "my-uint": 12,
          "my-string-array": ["a", "b"]
      }
    },
    {
      "id": 2,
      "vector": [0.2, 0.2],
      "attributes": {
          "my-string-array": ["b", "d"]
      }
    },
    {
      "id": 3,
      "vector": [0.3, 0.3],
      "attributes": {
          "my-string": "three",
          "my-uint": 84
      }
    },
    {
      "id": 4,
      "vector": [0.4, 0.4],
      "attributes": {
          "my-string": "four",
          "my-uint": 39,
          "my-string-array": ["c"]
      }
    }
  ]
}

// Response payload
{
"status": "OK"
}

Contact
Email us
© 2024 turbopuffer Inc.
Privacy PolicyTerms of service
SOC2 Type 1 certified