To improve search quality, multiple strategies can be used together. This is called hybrid search.
turbopuffer supports dense vector search and BM25 full-text lexical search. Combining them produces semantically relevant search results, as well as results matching specific words or strings (i.e. product SKUs, email addresses).
For both dense vector search and full-text search, turbopuffer supports passing additional filtering parameters to refine search results by matching against document attributes.
In this guide, we'll insert documents, search by both text and vector, and discuss ways of fusing the two result sets together.
We start by adding documents using the upsert endpoint, including the text content and associated vectors.
We also need to set the schema field to indicate which attributes should be indexed with BM25. The schema field only needs to be passed on the first upsert to a given namespace.
// Upsert documents
// POST https://api.turbopuffer.com/v1/vectors/namespace-name
{
"ids": [1, 2, 3, 4],
"vectors": [[0.1, 0.1], [0.2, 0.2], [0.3, 0.3], [0.4, 0.4]],
"attributes": {
"my-fav-number": [2, 4, 8, 16],
"my-text": [
"the quick brown fox jumps over the lazy dog",
"Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
"hello world",
"the pufferfish is my world"
]
},
"distance_metric": "euclidean_squared",
"schema": {
"my-text": {
"type": "string",
"bm25": true
}
}
}
// Response payload
{
"status": "OK"
}
We offer various bm25
customization options. To learn more, see the schema docs.
Queries for the top-10 most similar documents to [0.5, 0.5]
, i.e. using a dense vector as the query. In this case, we've also specified a filter asking for documents whose my-fav-number
attribute value is greater than 3.
// Request payload
{
"vector": [0.5, 0.5],
"distance_metric": "euclidean_squared",
"top_k": 10,
"filters": ["my-fav-number", "Gt", 3]
}
// Response payload
[
{
"dist": 0.0199,
"id": 4
},
{
"dist": 0.0799,
"id": 3
},
{
"dist": 0.1800,
"id": 2
}
]
Does a full-text search over the my-text
attribute for the query whose world is this?
.
// Request payload
{
"rank_by": ["my-text", "BM25", "whose world is this?"],
"top_k": 10,
"filters": ["my-fav-number", "Gt", 3]
}
// Response payload
[
{
"dist": 0.60278,
"id": 3
},
{
"dist": 0.53768,
"id": 4,
}
]
At the moment, we don't provide any out-of-the-box mechanism to "fuse" the results of full-text and vector queries. As such, you cannot specify a vector
query alongside a rank_by
field in a single query; these must be done in seperate requests. We'll likely support rank fusion in the future, though haven't released this functionality yet as we're waiting to better understand requirements and what customers are looking for.
In the meantime, the simplest way of fusing results is by using an algorithm known as reciprocal rank fusion. It combines results from multiple search results by re-scoring documents according to their relative rank in their respective result set(s).
def results_to_ranks(results):
return {item.id: rank for rank, item in enumerate(results, start=1)}
def reciprocal_rank_fusion(bm25, vector, k=60):
bm25_ranks, vector_ranks = results_to_ranks(bm25), results_to_ranks(vector)
scores = {
doc_id: (1.0 / (k + bm25_ranks.get(doc_id, float("inf"))) + 1.0 / (k + vector_ranks.get(doc_id, float("inf"))))
for doc_id in set(bm25_ranks) | set(vector_ranks)
}
return [{"id": doc_id, "score": score} for doc_id, score in sorted(scores.items(), key=lambda item: item[1], reverse=True)]
Another idea could be to use an out-of-the-box reranker, or train your own. A few reranker providers to consider:
As always, if you have questions about rank fusion or hybrid search in general, feel free to contact us.