In late 2022 I was helping my friends at Readwise scale their infrastructure ahead of the launch of Readwise Reader (read-it-later app). We wanted to build a highly requested feature: article recommendations and semantic search using vector embeddings. Readwise was paying ~$5k/month for their relational database, but we found that vector search on the same 100m+ documents would've cost $20k/month+! This wasn't just expensive; it meant we had to shelve a desired feature until costs came down.
It gnawed at me for months. A promising feature, postponed solely due to infrastructure costs? Surely these exorbitant prices weren't rooted in some immutable law of physics. Had we simply failed to create a search engine that truly harnessed the power of modern hardware and services?
As I went deeper, painful memories of search outages from my early days on Shopify’s infrastructure team started flashing. The existing solutions aren’t just expensive, they’re also incredibly difficult to operate at scale. It was clear to me that a lot had changed since the current generation of search engines were designed: object storage is now ubiquitous, NVMe SSDs have become incredibly fast and affordable, and AI & vectors means we’re asking more from our search systems than before.
In 2022, production-grade vector databases were relying on in-memory storage at $2+ per GB, not counting the extra cost for durable disk storage. This is the most expensive way to store data. You can improve this by moving to disk, with triply replicated SSDs at 50% storage utilization which will run you $0.6 per GB. But we can do even better by leveraging object storage (like S3 or GCS) at around $0.02 per GB, with SSD caching at $0.1 per GB for frequently accessed data. That’s up to 100x cheaper than memory for cold storage, and 6-20x cheaper for warm storage!
That’s when I decided this is what I wanted to do: build a search engine as you would build it in 2023 (the year development started). One where cost maps better to value, so features that aren’t being built today will get built. When you make gas 20% cheaper, people drive 40% more. Coupled with the tailwinds of retrieval in AI, it seemed like the right time to start building this was yesterday.
Fast forward to today, and turbopuffer offers a new approach to search, combining cost efficiency with high performance. By leveraging object storage and smart caching, we've built a solution that scales effortlessly to billions of vectors and millions of tenants/namespaces. We’ve heard loud and clear from our customers they have felt constrained by retrieval costs in their product ambition. We want to make it possible for our customers to search every byte they have.
╔═turbopuffer══════════════════════════════╗
╔════════════╗ ║ ║░
║ ║░ ║ ┏━━━━━━━━━━━━━━━┓ ┏━━━━━━━━━━━━━━┓ ║░
║ client ║░───API──▶║ ┃ Memory/ ┃────▶┃ Object ┃ ║░
║ ║░ ║ ┃ SSD Cache ┃ ┃ Storage (S3) ┃ ║░
╚════════════╝░ ║ ┗━━━━━━━━━━━━━━━┛ ┗━━━━━━━━━━━━━━┛ ║░
░░░░░░░░░░░░░░ ║ ║░
╚══════════════════════════════════════════╝░
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Let's take a step back to see how search engines fit into the broader infrastructure stack and how their requirements differ from other types of databases.
Companies typically start with a relational database (Postgres/MySQL) and gradually extract parts of their workload to ~5 specialized databases for performance, cost, and scalability. Searching a few million documents in your relational database is unlikely to cause issues. Any concerns at that scale are outweighed by the overhead of operating another database. But, with sufficient scale you start feeling like a butcher with a Swiss Army Knife. You'll know you've hit this point when the same workload keeps showing up in the database query profiles during problems.
The five most common specialized databases in the modern infra stack are:
Category | Tech | Read Latency | Write Latency | Storage | Use-Cases |
---|---|---|---|---|---|
Caching | Redis, Memcached | <500µs | <500µs | Memory | Cost/performance |
Relational | MySQL, Postgres | <1ms | <1ms | Memory + Replicated SSDs | Source of truth, transactions, CRUD |
Search | ElasticSearch, Vector DBs | <100ms | <1s | Memory + Replicated SSDs | Recommendations, search, feeds, RAG |
Warehouse | BigQuery, Snowflake | >1s | >1s | Object Storage | Reports, data analysis |
Streaming | Kafka, Warpstream | <100ms | <100ms | Replicated HDDs or Object Storage | Logging, moving data between systems, real-time analytics |
Edit: A few edits were made to these numbers, which are meant to be directional.
The storage architecture of the current generation of search engines simply doesn’t map closely enough to the performance characteristics and cost constraints of search. We can do better by moving from the old world of replicated SSDs to object storage with SSD and memory caching.
First Principle Storage Costs
RAM + 3x SSD | ████████████████████████████████████████ $3600.00/TB/month (incumbents)
RAM Cache† + 3x SSD | █████████████████ $1600.00/TB/month (incumbents, relational databases)
3x SSD | ██████ $600.00/TB/month
S3 + SSD Cache† | █ $70.00/TB/month (turbopuffer)
S3 | $20.00/TB/month
*: 50% disk utilization
†: 50% in cache
1TB: 120M documents with 1536 dimensional vectors and 2KB text data
The current generation of search engines are built using the replicated disk architecture typical of relational databases. This setup is excellent for low latency, extremely high concurrency for updates, and transactions. However, search engines don’t require all this! Their write workload is more akin to a data warehouse: high write throughput, no transactions, and more relaxed latency requirements, especially for writes. Consequently, for search engines we end up paying a serious premium for storage performance characteristics we don’t need!
Unlike data warehouses, search queries do need to finish in <100ms rather than seconds. Because requests to object storage take more than 100ms, we still need to reduce latency with an SSD/memory cache for the actively searched data.
Occasionally, you might get a cold query from object storage that takes a few hundred milliseconds if a node dies or the dataset has been evicted from the cache. For most search use-cases, the savings are well worth the latency hit of an occasional cold query
This design allows freely walking the tradeoffs between cheap, cold, slow storage, and warm, expensive, fast storage. The architecture is just as fast for warm queries, but cheaper than alternatives that require more copies of the data for durability. For infrequently accessed data, turbopuffer’s design is more than an order of magnitude cheaper. The design especially shines when only subsets of the data (e.g. tenants) are active at once.
We set out to build a database that fully leverages this architecture. Not only is it highly cost-effective, but being backed by object storage offers unparalleled reliability and virtually unlimited scalability. In addition, if something is queried enough and makes it all the way to the small memory cache, there is no reason why hot queries can’t be as fast as in another architecture.
Justin (CTO) and I learned the hard way during our 5+ years on the last-resort pager at Shopify that the fewer stateful dependencies, the more nines of uptime. This is why turbopuffer has no dependencies in the critical path other than object storage. From day one, turbopuffer has employed multi-tenancy and sharding, which we know from our days of protecting shops from each other at Shopify is paramount for reliability. This architecture is a major reason we’ve been able to maintain 99.99% uptime since launch.
To achieve this, we have built a storage engine native to object storage. This
is not tiering, where cold data is eventually replicated to object storage: it
is an object-storage-first storage engine where object storage is the source of
truth (LSM). Writes are durably committed to object storage. Existing storage
engine/LSM implementations don't apply here because the rules are different on
object storage: any node can compact data, latency is high, throughput is
phenomenal, individual writes are expensive, and storage is cheap. Each search
namespace is simply a prefix on object storage. If a node dies, another will
load into cache after a cold query (~500ms). We don't sell "High Availability"
(HA) at an extra cost, as any node can serve traffic for any namespace (though
we do attempt to route traffic to consistent nodes for cache locality). If you
want our HA number, I’ll just send you the current kubectl get pods | wc -l
.
Additionally, because much of our compute runs on spot nodes, we constantly
exercise this failure path.
In order to optimize cold latency, the storage engine carefully handles roundtrips to object storage. The query planner and storage engine have to work in concert to strike a delicate balance between downloading more data per roundtrip, and doing multiple roundtrips (P90 to object storage is around 250ms for <1MB). For example, for a vector search query, we aim to limit it to a maximum of three roundtrips for sub-second cold latency.
We are continually working to reduce latency and have many tricks up our sleeve to improve both cold and warm latency over time. For most production search applications, the current performance is already excellent:
1M 768 dimensional vectors (3 GB)
Cold Query Latency | █████████████████████████████████████████ (512ms)
Warm Query Latency | ██ (37ms P90)
You can read more about our architecture, our roadmap, and current limitations.
Our first large customer was Cursor, the AI Code Editor. Each codebase is turned into a vector index to power various features. Cursor manages billions of vectors in millions of codebases. With their previous provider, they had to carefully binpack codebases to nodes/indexes to manage cost and complexity. In addition, the costs were astronomical, as every index was kept in memory, despite only a subset of code-bases being active at any point in time. Cursor's use-case was a perfect fit for turbopuffer’s architecture.
Cursor moved everything in a few days in November of last year, and immediately saw their costs drop 10x with great cold and warm latency. In addition to their natural growth, it wasn’t long before Cursor started creating more vectors per user than before, as infrastructure cost and customer value started mapping far better. Cursor never stores plain text code with turbopoffer, and goes even further applying a unique vector transformation per code base to make vec2text attacks extremely difficult.
turbopuffer also powers Suno’s radio feature, Dot’s memory (by New Computer), the memory of Shapes, and many more.
We are very much in production, but currently open by application only, as we focus on optimizing turbopuffer for our early customers. Let us know if you are a good fit. We hope you’ll trust us with your queries.