Permissions

When a namespace contains documents belonging to multiple users or groups, queries should only return documents the user has access to.

Permissions in turbopuffer currently have to be implemented at the user-level with filters, as turbopuffer doesn't provide built-in mechanisms for row/document-level RBAC.

Store the user_id or group_ids that have read access directly on each document. At query time, fetch the user's id and groups from your auth layer and pass them as a filter. Generally this approach is more performant than passing document ids in a filter.

An array can be up to 8Mib in size so any group and user id identifiers stored on each document have to fit into this limit. We store filterable attributes in an inverted index structure that allows us to efficiently filter 10 000s of user ids without performance degradation; the sidebar widget shows representative p90 latency as the number of permission ids in the query grows.

To reduce storage costs associated with storing user and group permissions on each document, encode them as uuids. Note that the uuid type needs to be explicitly specified in the schema, otherwise the type will be inferred as a slower and more expensive string type.

import os
import turbopuffer


tpuf = turbopuffer.Turbopuffer(
    region='gcp-us-central1', # choose best region: https://turbopuffer.com/docs/regions
    api_key=os.getenv('TURBOPUFFER_API_KEY'),
)

ns = tpuf.namespace(f'permissions-example-py')

# write a few sample documents that are permissioned by group and user_ids

ns.write(
    upsert_rows=[
        {
            'id': 1,
            'vector': [1, 1],
            'content': 'changes in the leadership team',
            'groups': [],
            'user_ids' : [123, 453, 125, 189]
        },
        {
            'id': 2,
            'vector': [2, 1],
            'content': 'simon & nikhil - 1:1 notes',
            'groups': [],
            'user_ids' : [123, 125]
        },
        {
            'id': 3,
            'vector': [6, 1],
            'content': 'notes on planned Kubernetes migration',
            'groups': ['eng'],
            'user_ids' : [96]
        }
    ],
    schema={
        'content': {
            'type': 'string',
            'full_text_search': True
        }
    },
    distance_metric='cosine_distance'
)

# now we can query the data passing in the appropriate permissions

result = ns.query(
    rank_by=('content', 'BM25', 'notes'),
    filters=('Or', (
        ('groups', 'Contains', 'design'),
        ('user_ids', 'Contains', 96))),
    limit=10,
    include_attributes=['content']
)
print(result.rows)

# [Row(id=3, vector=None, $dist=0.9686553, content='notes on planned Kubernetes migration')]