Query prices reduced by up to 94%

Object storage-native database for search

March 09, 2026CMU Seminar with Andy Pavlo

Transcript

Andy [0:26]:
Let's get started. We're excited today to have the CEO and co-founder of turbopuffer, Simon Eskildsen. Simon started turbopuffer three years ago. Prior to that, he was eight years at Shopify, where he was principal software engineer and director of engineering. So a lot of things I'll talk about today are the things he learned about at Shopify. As always, if you have any questions for Simon as he's given his talk, feel free to interrupt him at any time. That way, he's not talking himself for an hour on Zoom. Simon, the floor is yours. Thank you so much for being here. I appreciate it.

Simon [1:00]:
Thank you, Andy. Thank you for having me. I just want to reiterate that if anyone has questions or anything during the talk, I very much encourage that. I don't want to talk just for an hour with no interruptions. But otherwise, I know Andy won't be shy to take any questions from the chat. So please, my name is Simon, and I will be talking about turbopuffer today, a database that we built on object storage that's currently primarily focused on search.

Simon [1:28]:
So turbopuffer is a database that's mainly optimized for search, while it can do a bunch of other things as well. I think the easiest way to show what it can do is just a simple full-text search query like this, where if you are searching for something, you can imagine that you might need to rank by various different signals. You have some hard filters on, you want some timestamps after a certain period, and you want to limit the results to something. You can do this with vectors; you can do arbitrarily different kinds of full-text search queries, like ranking by boosting results that have a more recent timestamp. All of these things are very useful for searching. But you can also do more than just search in turbopuffer. You can do aggregation, you can group by different attributes, you can do ordering by any type of column, and even multiple columns. So turbopuffer has lots of other features when your data is in it.

Simon [2:20]:
I think that in the limit, a database ends up more or less implementing every query plan on the data structures that are implemented inside of the database, but you have to start with a very particular workload, and that workload for turbopuffer right now is search. The other thing about turbopuffer is obviously upserting your data into the database. You insert your data through turbopuffer; you don't just put it on object storage and then turbopuffer fetches it. You have to do it through turbopuffer itself. We provide a variety of different queries while writing that you can do.

Simon [3:00]:
For example, something that's very common when you're writing into a search engine is that you only want to write a record if the version that you're writing is newer than the version that's in the database, right? So you can imagine that you have a system where Shopify, for example, had a million of these systems where every time someone changes a piece of data, they write it into the database, and they have bulk ingestions into the background if you want to make sure that only the most recent document survives. You can do a variety of other things like doing patching of attributes with a filter, right? This is like an update where, and right now the JSON API to turbopuffer is fairly simple, and over time we expect that we'll probably ship a SQL interface into turbopuffer as the queries become more and more expressive.

Simon [3:50]:
I wanted to also just frame a couple of the numbers around the scale that turbopuffer runs at just to try to contextualize what kind of environment we're making decisions in and what we're scaling for. As far as I'm aware, we host the most vectors in the world with almost 3 trillion, or actually more than 3 trillion vectors and documents with full-text search in production. We run these at tens of thousands of queries per second, and we see peaks just shy of 100 GB per second. We also have more than 200 million namespaces. These namespaces you can think of as shards or tables, and we'll get into more depth of them later.

Simon [4:30]:
The largest search spaces that we see are essentially people ingesting all of Common Crawl into turbopuffer and searching it at once. This is often in the high tens of billions of documents that you're searching at once, which poses lots of challenges that we're going to talk about today. This talk is composed of three different parts. The first part, we'll talk about V1. V1 was the version of turbopuffer that I wrote at my cabin up here in Canada over the summer of 2023 and put into the market to see if anyone would care about this kind of database. We took that to its limit and eventually turned it into a real database with V2.

Simon [5:10]:
The other thing I want to connect this to is Andy's theme of basically justifying turbopuffer's existence against being a Postgres extension or merely just Postgres. Why build another database, and what kinds of conditions have to arise for a new database to be successful, both commercially and also usefully to companies around the world who are building software around it? So first, we will talk about V1. To properly, I think, understand the database, you have to understand the DNA and the origin of the people who wrote the database and what kinds of scars or experience, whatever you want to call it, they go in with as they're designing the system.

Simon [5:50]:
My background is not in database research. It's not even in working on databases. I didn't even go to university. My experience is fully from operating these databases and building layers on it for scale. I worked at Shopify between 2013 and 2021 while we were scaling from a thousand requests per second to more than a million. The hardest part of these apps to scale ends up being the databases. That's what's most likely or causes the most painful outages on Black Friday, and that's what we're always preparing for. Through my time there, I worked primarily with the databases below: MySQL, Redis, Elastic, Memcached, Kafka, Zookeeper, and many others, and learned to combine them in all kinds of creative ways.

Simon [6:30]:
I think that being on the last resort pager, where I've been paged by every single one of these databases many, many times, just irrevocably has changed the way that I write software and also the way Justine Li, my co-founder, writes software. I have debugged every single one of these databases for hours on end, completely sleep deprived. I have been woken up by some of these databases every single day in the days leading up to Black Friday. It's taught me pretty much everything that I know about how to design systems that do not wake you up because that's what you end up optimizing for. So for better or worse, this is one of the most important things that I design software now with in mind.

Simon [7:10]:
The other thing that I learned is that it's very uncommon to have this long of a tenure at a tech company these days. Eight years working on a single piece of software and scaling it teaches you a lot about how software ages well or what doesn't age well. I think my biggest takeaway from this, which is quite congruent with being on the last-resort pager, is that it continues to shock me how the things that you put in place three days before Black Friday, with spit and bubble gum, as my boss used to call it, tend to be some of the things that last the best.

Simon [7:50]:
I keep saying about my co-founder Justine that she is probably the best person I've ever seen come up with one-line fixes to 100,000-line solutions. This continued to surprise me again and again how something very, very simple, when you think hard enough about the problem, can age very well compared to the RFC that's been debated among ten principal engineers and gone through revisions by every team before they make it into production. Every single one of these databases has heavily influenced the way that we think about designing turbopuffer.

Andy [8:20]:
I mean, I don't want to like, you know, sit and rag on other systems, and that obviously depends on many different factors of where you deploy them, but which of these six was the worst?

Simon [8:32]:
The one that I now compete with.

Andy [8:36]:
Okay. Then we're going to go scars with. Okay. Sure. Yes. Fair.

Simon [8:40]:
The other major thing that goes into designing turbopuffer is this project I run called Napkin Math. Napkin Math is essentially a collection of different numbers of how a machine is supposed to perform. There's more than the ones that are right; they're just some of the most applicable to turbopuffer. Napkin Math is basically just a collection on, okay, what kind of bandwidth can I expect on an SSD read, right? Not what the device says on the drive, but what a reasonable amount of code would do. What can I expect from object storage? This number I didn't get to update in time, but you can expect about 100 MB per second from a single connection, but of course, you can exhaust the NIC if you use as many connections as possible.

Simon [9:30]:
The p99 is around 50 milliseconds, and that doesn't vary very much between a block size of 128k and 1MB. All of these numbers are maintained in this repository with the first Rust that I ever wrote. The reason that I worked on this was because in my role as a principal engineer at Shopify, what I spent a lot of time on was reviewing how people were going to use some of the databases we just talked about before for developing some feature. Often they would show up with a profile of, okay, I did the workload in this way on this database, and then they would make decisions based on these benchmarks.

Simon [10:00]:
Often those benchmarks were wrong, and what I found myself doing again and again was just saying, well, look, it has to visit the B-tree this many times; there's no way it's going to take more than this amount of time. Then sure enough, you look at the query plan, and the database wasn't doing what it was supposed to do. So I collected all these numbers as a way to basically construct a model, a very simple model of the system, and then construct a very simple hypothesis for how it should perform. Then you see the difference between how the system is actually performing on a query and what the math is telling you, and if that gap is off by an order of magnitude, then you know that either there's a massive opportunity to improve the system or there's a bug in your understanding and your model of the system.

Simon [10:50]:
This always beats profiling. Profiling will tell you how to optimize the system 10%, 5% here or there, but it will not tell you what the absolute floor of latency or performance is. The other thing that was in the air was building a database with a different kind of design. Back to my scars of being on call for a very long period of time, it seemed clear to me that the only database that I would go on call for was one where I didn't have to operate the storage layer, where the storage layer was entirely reliant on S3, including even the consensus layer.

Simon [11:20]:
When I think about an object storage native database, the way that we've decided in turbopuffer for maximum operational simplicity and for reliability, an object storage native database is one where object storage is the only stateful dependency. There's no east-west coordination. There is no consensus layer in a separate system. Everything about the system is in simple files on object storage. The other condition for an object storage native design that I sketched out in 2023 was that the storage engine and the coordinator planner should be object storage aware. It should know that the round trip p99 to object storage is around 100 milliseconds. It should know that those round trips are costly, but concurrency to the system is heavily encouraged, right? If you can minimize the number of round trips but do a lot of outstanding concurrency, that's really good for object storage.

Simon [12:10]:
So the query planner and the storage engine should be deeply object storage aware in a way that's not possible to easily bolt onto another system. The third condition for an object storage native database is one where coordination is done with compare and swap directly on object storage. Modern object storage today has these operations where you can essentially fetch an object, you get a tag back, and then you only write the object back to object storage if the tag is matching. This allows you to drive consensus with object storage at the cost, obviously, of latency. But if you carefully design the system, you can still get very good performance with this primitive alone.

Simon [12:50]:
This kind of database is only one that you've been able to build as essentially December of 2024 when S3 finally released Compare and Swap. Google Cloud Storage had this available beforehand, which was actually not because of, it was just because that was the one I was familiar with at the time when I started turbopuffer because Shopify ran on GCP. NVMe SSDs were also not available in the clouds until 2017, and most databases today still aren't running on them. They're all running on these network disks that are not particularly fast. S3 also didn't become consistent until 2020, meaning that if you wrote an object, you're not guaranteed to get the same thing back in the next read, which meant that all of the databases built on object storage before December 2020 had a fat metadata plane to coordinate all of this.

Simon [13:40]:
These conditions allow us to create a database that is completely different from the ones that go before it. It has tradeoffs, and I think every database, when I look at the website, sometimes I feel like when I'm looking at a database website, it's difficult to figure out if they're selling a sneaker or a database. To me, what should be very clear is that every database makes very, very different tradeoffs, right, in terms of how they store the data and the storage architecture that they've made. The trade-offs for turbopuffer's storage architecture is that it's very low cost. Everything goes to object storage, $0.02 per gigabyte to store. It's very simple. You could blow away every single VM in turbopuffer's accounts, and we would not lose data because we will never acknowledge a write to the writer until it's been acknowledged by S3.

Simon [14:30]:
Simple horizontal scalability because all of this is sent to the hundreds, if not thousand-plus engineers that work on S3 or Google Cloud Storage. Warm queries, there's no reason when they're in cache in DRAM and in the NVMe SSDs that it's as fast as any other system. Write throughput can be phenomenally large, right? Turbopuffer just closed to 100 GB per second of peaks. The weaknesses are very clear, right? You have slow cold queries when you go directly to object storage. The p99 of reading a block from object storage is around 100 milliseconds, and it can spike higher than that if you have a lot of objects on a node, and there's a lot of opaque caching at play. So a cold query to turbopuffer can easily be 500 to 1,000 milliseconds.

Simon [15:10]:
You have high write latency. The p99 to a put on object storage is around 100 milliseconds for a small write, and obviously this can get larger as they're larger. And so this prohibits certain types of workloads, right? You could never just hot swap a relational database, even if turbopuffer supported all of SQL. Something like Shopify would just never work on a database like that because the write latency would be too high for doing things like inventory reservations. And then the last thing is that because we have to drive consensus with object storage, it means that to get consistent queries, we have to round trip with a conditional GET (If-None-Match) to make sure that the caching node has the latest write, which takes about 8 milliseconds on S3 and about 15 milliseconds on Google Cloud Storage.

Simon [15:50]:
You can mitigate these weaknesses in the limit, right? You can pin things in cache, and you could implement another type of east-west coordination or something like that to try to drive consensus. But in the object store native design that turbopuffer is using, these are the fundamental weaknesses. But these are very good trade-offs for search. So to go back to starting to write V1...

Simon [16:40]:
So...

Simon [16:40]:
please.

Simon [16:41]:
we have a question from the audience. Do you want to unmute yourself? Go for it.

Audience Member [16:47]:
Well, I just had a quick question. So there were strengths and weaknesses. I saw a high write throughput in the strength, but there was a high write latency in the weakness. So I was just curious, like these two seem to collide, and I was just wondering why.

Simon [17:04]:
Yeah, so we can basically, you can max out the network, right? So you can write, if you, modern VMs, the ones that turbopuffer runs on, have maybe 50 gigabits per second, right, of networking. And so we can max out that and write that into object storage. But in order to actually, even if you did a small write of just a few kilobytes, and we have to round trip to S3, we have to wait 100 milliseconds. So it's sort of a classic throughput-latency tradeoff where if you're writing 1 MB or 100 MB, we might be able to do that in the same amount of time, even though in another database those would have very, very different characteristics.

Audience Member [17:44]:
Thank you.

Simon [17:49]:
So the problem that turbopuffer initially was trying to solve and is still solving is one of reducing the cost of search dramatically. I knew this problem from the back of my head from my time operating the Lucene-based clusters at Shopify and the operational woes of doing so. But when I left Shopify in 2021, I was helping a bunch of my friends' companies with their database struggles, which in 2021, by the way, mostly boiled down to tuning autovacuum in Postgres. One of these problems was my friends at this company called Readwise. We index a bunch of articles and wanted to build search, and in particular vector search over all of it. I ran the back of the envelope math on it, and it would have cost 30 grand a month to index all the data. This is a bootstrapped company that was spending five grand a month on all other infrastructure combined.

Simon [18:40]:
So just doing search for 30k a month was just not tractable. It was completely bonkers to me that it would even cost that much to put all these vectors in. And it occurred to me that it was because the incumbent solutions were all doing all of this in memory, even though it seemed particularly well-suited to me to a completely different architecture. These were kind of the things that I put together to start the first version of turbopuffer. If we look at sort of the math on the cheapest way to store and the most expensive and most performant way to store, the classic way to run a database is that you replicate it among three SSDs, and then you have up to 100% of the data that's hot in RAM. If you run that on a modern cloud, you might be running north of between $2 to $3 per gigabyte that you're operating.

Simon [19:30]:
This is two orders of magnitude more than having all of that in S3. Now, of course, S3 has completely different performance characteristics, right? High throughput, high latency. But it felt to me that you would be able to create a database that would be able to puff up in between these layers of the memory hierarchy based on, okay, Andy's nodding; he's getting the name now, maybe even turbopuff up between into the NVMe SSD and into memory. I think one of the fundamental trade-offs that a lot of the database systems make, right, is if you think about the hierarchy between object storage, the SSD, and memory, you want to try to keep as little as possible in memory and disk, right, and as much as possible in object storage.

Simon [20:20]:
And so the cache efficiency of the system is very important. But over time, you know, in the limit, if you're doing 10,000 QPS to a workload, the economics of DRAM makes sense. If you're doing a query every hour, probably it makes sense to have it in object storage. So if you can build a system that turbopuffs into the memory hierarchies at the right time, this could be very, very useful for search. Think about one of our customers like Cursor or Notion. When you open a Cursor workspace, right, it doesn't need to be in memory until you're starting to query that code base a lot, right, and all the embeddings for it. Same with Notion, right? When you search turbopuffer, right, we can know ahead of time to start warming the caches as you type into the query.

Simon [21:10]:
So there's lots of tricks here that we can play to warm it into the memory hierarchy, and if something is busy, well then we can have it in the memory hierarchy at all times. The first version of turbopuffer was very, very simple. It ran on an 8-core commodity instance in GCP, and this is simple, but it's not irresponsible, right? Every single write was committed to S3, and this object storage native model allows you to create really simple but still very reliable databases. Turbopuffer at the time was really novel in charging about a dollar per month per million vectors. Some of the other incumbents were charging close to a hundred dollars for storing something like that, and so that was what really brought this onto the map.

Simon [22:00]:
I should also mention that really this was more of a summer project. I did not intend for this to become a big company. It was really just a, I want to see if I can do this because it seems fundamentally possible, and the idea just completely consumed me. As you can see here too, I don't say that it's the fastest. A lot of databases are like, "Oh, we're so fast." It's like, "No, it's reasonably fast." But the cost unlocked new use cases that I'd known right from working with Readwise. So now we'll start to get into the meat of the technical implementation of V1. Some ideas survived, and some did not.

Simon [22:40]:
One of the things that survived is the fundamental architecture, right? I mean, pretty soon after it was launched and there was interest, we ported this over to something that would run on multiple nodes, but this architecture is largely the same today. We've evolved it from here, but this is the current form, and it served incredibly well. If we follow a read through this, you can imagine it first hitting the load balancer. The load balancer will take the table or the namespace, as we call it, that you're querying and then do a consistent hash on it, right? Proverbially a consistent hash. You can imagine that this is arbitrarily complicated to try to get you to the node that is most likely to have that namespace in cache.

Simon [23:20]:
So it will go there and then it will start reading the metadata and then fetching all the different blocks that are required to service that query, going through the memory cache, then the NVMe cache, and finally going to object storage for the files that it needs, or the subsets of the files that it needs to service that query. So if you do a cold query on a million vectors or a few million vectors, it might take about 500 milliseconds to do the three to four round trips to object storage. If it's in cache, it will take about 10 milliseconds. The majority of that time is spent waiting for S3 to round trip to make sure that we have the latest data. If it's in disk, it might take a little bit longer, right, like maybe 20 milliseconds or something like that. But just to give you an idea of the profiles through the cache here.

Simon [24:10]:
For a write, the path is more or less the same. When it goes to the same node that would be doing the queries, we write it directly to the write-ahead log on S3. So you could imagine sort of like 1.bin, 2.bin, 3.bin; we just continue to append into that directory. When it's been committed to S3, we'll send back a success to the client, and then we will write it into the cache on that node if we think it's prudent to do so. If there's a lot of reads and writes going on at the same time, it's a good idea to go straight into the WAL with any new writes.

Simon [24:50]:
When the WAL is far enough ahead of where we've last built an index, like the vector index, full-text search index, attribute indexes, and other secondary indexes that we might build, it will send a, it will put it into the queue. The queue also exists on object storage, and the indexing nodes, sometimes in other designs called the compactor nodes, will work off of that queue and do things like compaction, building indexes, and things like that. And similar to the query nodes, there needs to be some affinity. The thing that I love the most about this design is that we can just move namespaces around by changing where they are in the load balancer.

Simon [25:30]:
At Shopify, we wanted to change where a shop lived, right? We had to move all of the data between MySQL shards. But with this design, we can just do a cache miss or start warming another node. The other thing that's really nice about this design is that if you want replicas, you're really just round-robin to multiple nodes, and they just start picking it up from cache. So if we have too many reads, we can just start spreading out the load, and obviously, it also makes deploys really, really nice because we can just roll through the whole system at some pace and just manage the cache fit rate.

Simon [26:10]:
The write-ahead log has also survived from V1. It's really about as simple as you might imagine. We don't actually use 0 and 2 and 3.bin. These are UUIDs committed into a manifest, but for simplification here, this is how the WAL progresses over time. Most databases do this, at least MySQL does this, where you're accumulating writes into a buffer, and then you sync it periodically. So in something like MySQL or on a disk, right, if you have a bunch of like 100-byte writes, you're not fsyncing every single time you get a hundred bytes; you will try to accumulate some to fsync a larger buffer.

Simon [27:00]:
It's really the exact same design on object storage, just rolled up into latencies in the hundreds of milliseconds instead of in hundreds of microseconds. So that's what we do, right? When you're doing a lot of individual writes, we'll coalesce them into a buffer, and then we'll commit them to object storage. All the different object storages have very different characteristics on how you can commit to the WAL, but it will just continue to progress like this where we commit a file and then commit it into a manifest with compare and swap. When you do a read, we will do a consistent read on the manifest that has an inventory of the WAL.

Simon [27:40]:
That we can do with a conditional GET (If-None-Match), which only goes to the metadata layer, which has better performance than actually loading the file, which has to go through the storage layer of S3. These are different latency profiles on all of the different object storage implementations. S3 is about 8 milliseconds, GCS is about 15 to 18 milliseconds, Azure is actually 5 milliseconds, last I heard. So we get that, and then we play the amount of the WAL over what we got back from the index to make sure that the query is consistent.

Simon [28:10]:
This is a bit of an unusual design choice for most search engines that just have eventual consistency. But we've worked around that before, like when working with Elastic and other solutions like that, working around eventual consistency is such a pain that we did not have high conviction in giving that up. You can give that up, and then your queries will be faster. But by default, turbopuffer is strongly consistent and will replay the WAL entirely on top of the index read.

Simon [28:50]:
Another design characteristic of V1 was the implementation of or the idea of essentially giving access to unlimited shards. When Justine and I were designing different systems at Shopify, we were always just abusing the shop ID key as much as we possibly could. I've always wanted a database where that was available as a low-level primitive where every single piece of data that is logically divided from everything else could be completely separated. Similarly, also the primary key matters. This is one of the things I love the most about MySQL over Postgres is that it's very easy to organize all the data by the tenant ID, which in Postgres you'd have to rewrite the entire table to do, or the entire heap.

Simon [29:30]:
So turbopuffer exposes this as a first-level primitive. When you work with a namespace or what you might call a table or a shard, it has a particular prefix on object storage. Now, this is not actually what it looks like, right? You have to randomize some prefixes and things like that to get it to be performant. But this is the simplest sort of lie to children metaphor for what we do. We have a bunch of metadata files, the WAL is growing, the LSM has a bunch of SSTables, and then you have a large metadata file that sort of spells out what keys are where, right, in a very, very simplified LSM 101 type of design.

Simon [30:10]:
We have hundreds of millions of these namespaces in production, and maintaining all this metadata is certainly a challenge, but it allows for really good cache efficiency. Between use cases like Notion and Cursor where all the data is so logically separated apart from each other, and your bin packing problem becomes a lot nicer and more tenable.

Audience Member [30:03]:
What is the metadata challenge? Is it because S3 just chokes on it because you have so much crap in it? Or is it your side maintaining the in-memory representation of what's out in the namespaces?

Simon [30:14]:
So you're going to have to once in a while loop through all of these things for like once a day or whatever to do billing or different reconciliation. The list call to get a thousand entries takes somewhere between 500 milliseconds and a second. So just paginating through the whole thing requires you. I have a bonus slide if we have time of what I call the object storage bag of tricks. But you have to play a bunch of tricks still doing this on object storage entirely to manage that. For example, these are the kinds of challenges that we would have, right? And then in the billing layer, it's very high cardinality as well, yes.

Audience Member [30:51]:
So if Amazon magically made the metadata lookup be sub-millisecond, your life is easier.

Simon [31:00]:
I would say so. It would be a lot easier. There would be a lot of things that we would not have to design around. The list call is not one that I recommend ever using in production on a critical data path.

Audience Member [31:10]:
Okay, awesome. Thanks.

Simon [31:14]:
The other thing that's very much survived from V1 and continues to be important in turbopuffer is to have very fast cold queries and to use object storage native query plans. The general insight here is that you want to minimize the number of round trips to object storage as much as possible, right? As was asked earlier, you can drive a very large amount of throughput. You can saturate the NIC, but you're not going to be able to do it in less than a few hundred milliseconds of latency. So you better make good use of it.

Simon [32:00]:
Now fortunately, this is also how NVMe systems work, that if you do a very large amount of outstanding concurrency and few round trips, they will perform incredibly well. But if you have a lot of dependencies in your I/O chain, they're not going to perform as well. So it just seems to be the trend all, I mean also in CPUs, right? Where if you can do a lot of dumb simple operations and very, with very, very little dependencies between them, the machines rip, and that's been our experience with object storage, NVMe, and also the CPU.

Simon [32:40]:
So everything is designed around this, and we're designing it even that these cold queries are quite common. So you can imagine in the first round trip, you get some metadata around like, you know, that lsm.json and the index.json. These might not actually be JSON files, right? But just for illustrative purposes, and then you get the upper layers of some tree, lower layers of some tree, and then maybe you, you know, you have select star or whatever, you get a bunch of data at the end if you couldn't speculate earlier on.

Simon [33:10]:
The second thing that you need for fast queries to object storage is that you're only going to get about 100 MB per second from every connection that you open. So if you have a large SSTable and you want to download it fast, you have to split it into a bunch of different ranges and then parallelize all of those requests. Subsequently, if you are just fetching a little bit of data, you might as well fetch the entire block and even adjacent blocks because fetching a megabyte of data might have a p99 of, say, I think it's around 80 milliseconds, right? Whereas fetching much less has a p99 of maybe 60 milliseconds.

Simon [33:50]:
So it can be advantageous to try to fetch more data at a time. Large block sizes, right, the p99 just does not vary very much between getting 16 kB and a megabyte. And then it's also important for all of this to be zero copy. A lot of tiered databases will sort of download something from S3 and then take its jolly time decompressing it and hydrating it into memory and then serving queries. Not all of them can just do a direct range read and then just start operating on that data immediately because they haven't been designed for it from day one.

Audience Member [34:30]:
Are you using io_uring to avoid zero copy from the NIC?

Simon [34:34]:
We haven't even needed to use io_uring yet. We're just using direct I/O. It has not become a bottleneck yet for like, this has surprised me.

Audience Member [34:41]:
But direct I/O will be for the disk, but if you have to pull from the network, right, it's got to go through the kernel.

Simon [34:46]:
Yeah, we're doing the copy right now. It hasn't been a bottleneck yet.

Audience Member [34:49]:
Oh, okay, okay. And then how large are your block sizes?

Simon [34:53]:
The block sizes, I believe right now they're around 128k.

Audience Member [34:57]:
Okay.

Simon [34:57]:
To balance it between the NVMe SSD and S3, you could imagine a system where you have different block sizes, but I think that's what it is. It might, they might have changed it.

Audience Member [35:06]:
Right. It's kilobytes, not megabytes.

Simon [35:09]:
Yeah, exactly.

Audience Member [35:10]:
Got it. All right, thanks.

Simon [35:11]:
On the right here, we can see that the cold performance is sort of like sub-second still as we go directly to object storage, and then the warm namespaces are just as fast as any other system. If you look at our production traces, you'll see sort of like maybe a millisecond or less actually servicing the query, and in 10 milliseconds going to S3 to make sure we have the latest entry from the WAL. So you can turn that off for other characteristics, but we think this is a really good default for search systems.

Simon [36:00]:
The reason why fast cold queries are so important is because if, let's say that our cold queries are 10 seconds, you end up just patching that everywhere in the system, right? You end up pinning more stuff into the cache, you end up having larger caches, and this causes poor economics to pass on to our customers. So it's very important to have fast cold queries for everything in the system to not be so fragile. The other thing in V1 was that in 2023 when I was writing the first version of turbopuffer, these HNSW or graph indexes for vectors were all the rage.

Simon [36:40]:
The problem with them to me is that they have a lot of dependencies, right? They're very difficult to speculate on. Connecting to the point before of modern hardware, if you, the simplest way to do a vector search is basically just to have a query vector and then compare it against every single vector in the dataset and then return the 10 closest ones by the angle that they have to each other. The most popular algorithms at the time were to essentially construct a nearest neighbor graph of this. You can imagine these vectors in two dimensions, right, as they are here, and you try to connect the vectors that are closer to each other in a graph.

Simon [37:20]:
Then you drop yourself at some point in the graph, and then you just kind of do a greedy search along it to find the closest vectors to your query vector. The problem with a graph is that every time you navigate between these edges, you can't really predict how far you're going to navigate and speculate and get all of this stuff, look page in from disk. So you end up doing all of these dependencies. And so if you're an object store, it's like 100 milliseconds, 100 milliseconds, 100 milliseconds every time you're walking the graph on disk, right? And a random read is in the hundreds of microseconds, which is still comparatively quite slow.

Simon [38:00]:
A clustered index is the most sort of traditional approach where you just take the vectors and you divide them into clusters. You try to find some underlying pattern, and then you average the vectors in the cluster and take the average, and you have what's called a centroid. When you're doing a search, you just compare your query vector against the cluster centroids and then only fetch the few clusters that are closest to your query and then exhaustively scan through those to get your top results. This has two round trips, right? You can imagine a cluster1.bin, cluster2.bin, and you just download the appropriate data.

Simon [38:40]:
That is not very far off from how it worked in the first version of turbopuffer. This is fast on disk, and it's fast on object storage, and it's very, very simple to implement. Caching has also not changed very much from the very first version of turbopuffer. The NVMe SSD is a simple cache tier. We just use a FIFO ring buffer where you just continue to write, write, write, write, write. At some point, you wrap around, and you start eating the tail. You can do a variety of different interesting techniques on top of this to make sure that values that should not be evicted from the disk cache are not.

Simon [39:20]:
Right now in turbopuffer, it's very simple. Complexity, we think, really has to be deserved, and right now this is working quite well. This is a read-through cache, right? So different objects are just, well, it's actually both read-through and write-through, right? At write time, we'll write into it, and then when we read an object from object storage or lower in the hierarchy, then we'll populate it up. The cache is prioritized, so namespaces that are really busy will prioritize warming over others, and cache efficiency is a pretty heavy investment area for us.

Simon [40:00]:
This is also where you get to control, not separating compute and storage. This is what I call compute-storage flexible, because there's no reason why you can't tell the cache to always keep something in cache if you want to make sure that your p99 is very low over very long periods of time. V1 sort of reached its logical limits in the middle of 2024. The first version of turbopuffer was literally, as the caching layer, a caching NGINX in front of S3. When I brought Justine, my co-founder, along, this was obviously the first thing that we fixed, but hopefully it just goes to illustrate how simple we've iterated on.

Simon [40:40]:
Every single part of turbopuffer is the simplest, most reliable thing that we know will work. I have operated NGINX for more than a decade, so I know it works, I know how to configure it, and every single layer of complexity we add on has to be deserved. FTS, like full-text search, was bolted on, but it was very limited as the more general-purpose database. I believe that any good database, any large database over time ends up implementing every query plan. And so it has to be designed as such, otherwise you'll be pigeonholed into a single workload.

Simon [41:20]:
When people have, when your customers have commercial attraction on your database, they will expect more and more and more types of queries to execute on it as they trust it and put more and more data on it over time. V1 taken to its logical extreme was what found us product-market fit, and also it was what ran Cursor, Notion, and also got us the POC with Anthropic and won that towards the end of 2024. While we were working on all of this, we were also in parallel, we'd hired Boyan.

Simon [42:00]:
And Boyan, actually, who I met at I.O.I. in 2012, actually knew how to write databases. And so in parallel of primarily Justine and also myself optimizing the V1, we were writing a proper V2 version of the database that we're now on. We now joke that there's still remnants of V1 in the code base, and Morgan calls this an important company objective to remove all of the founder code in the code base. So before we go to V2, I want to justify the existence of a new type of database, right? The title of this series of talks is why not just use Postgres for everything?

Simon [42:40]:
There's two things that I keep coming back to. The first one is if you want to build a large, commercially viable database, I think that you basically need to have the propensity that every single company on earth either indirectly or directly is going to have data in your database. If I think about even the coffee shop around the corner from my house, I can guarantee you that they have some of their data somewhere in an Oracle database, right? It's in a Snowflake database; it's in Databricks. The biggest database companies in the world have a little bit from every single company on earth.

Simon [43:20]:
If you want to build a big database company, then I think that has to be part of your vision. And the best time to do that is when there's some underlying new workload or paradigm shift that causes that to happen, right? For Snowflake and Databricks, suddenly we have these websites with lots of traffic, and we wanted to do large-scale analytics on it. Like fundamentally new economics, Oracle, Postgres, MySQL, and others came before it because, hey, great, we have computers; let's put lots of data in them and make it possible to query.

Simon [44:00]:
These are not generation one, two, and three in terms of one being better than the other; all of these are different phenomenal trade-offs, and turbopuffer is just another set of trade-offs that happens to work very well for the new workload right now, which is to connect very large amounts of unstructured data to LLMs. That's the workload that's causing traction for turbopuffer. The second thing that you need is that you need some new storage architecture that the incumbents can't easily do. If you only needed one, then you would sort of imagine that in 2013-ish, as everyone started creating mobile apps, a big company called MongoDB would have spun up and dominated the world because everyone has geo-coordinates in them.

Simon [44:50]:
But there's no reason why geo-coordinates can't be supported in the Gen 1 and Gen 2 databases. So the new storage architecture that we can create now is what I call compute-storage flexible or object storage native, which was enabled by these developments, and we can now create a database that has these trade-offs that are quite phenomenal for this new workload. These are the new attributes, and this is what would be very difficult to pull off inside of Postgres. Now, of course, inside of Postgres, you can make computers do anything in the limit, but there has to be some pragmatic trade-off of when you do this.

Simon [45:30]:
V2, so V2 of turbopuffer has a bunch of evolution on top of the ideas that were good in V1. The biggest change from V1 to V2 is that V1 you can think of as operating almost more or less directly on very simple files on object storage, whereas V2 is introducing a key space that's maintained by an LSM to implement all these different data structures on top. The other big change was to do incremental indexing. The hilarious thing about V1 was that once the WAL had progressed enough and we had to rebuild the index, we rebuilt the entire WAL. The quote-unquote compute amplification on this was enormous, but it was very simple, and it was very reliable to maintain.

Simon [46:20]:
These two also had things like branching and our own load balancer. So we'll go through these in a little bit more detail. The LSM, by academic standards, is like very disappointing. It's really a LSM 101 implementation. The system side of the LSM has been heavily optimized for performance, but in terms of compaction, it's more or less the simplest tiered compaction that you can imagine. We expect to spend a lot more time here as we find this to start becoming a bottleneck for new workloads, but the LSM is very simple.

Simon [47:00]:
If there's anyone in this room that's considering what to research at some point, I think the compaction territory that we're in with these object storage native databases is quite a different set of trade-offs than a traditional database, right? On a traditional database, you might, on a shard, you have to contend with the fact that you can't spend too much compute, too much I/O, too much memory trying to compact the data because you're contending for the same resources as you are when servicing the queries. In this case, you can, you can, you know, spin up a dedicated index node for a very short time, run compaction, and build indexes on it as well.

Simon [47:40]:
This has a different set of trade-offs, right? Space amplification is very cheap because we're using S3. You have to deal with this new type of amplification on the number of S3 operations that you're doing. Write amplification is cheap; read amplification is more expensive. Round trips matter, and throughput matters a little bit less. So this is a different set of trade-offs. And as far as we know, there's no research on doing compaction in this kind of environment.

Simon [48:10]:
The other thing that we did in V2 was make the vector indexing incremental. Again, somewhat hilariously, in V1, the vector indexing was not incremental. We rebuilt the world once the WAL had progressed enough. SPFresh is essentially a paper that talks about how incrementally to maintain clusters. If you do clustering, you can imagine you do a bunch of like very simple repeated multiplications and distancing to find the right clusters, and then you have the clusters. You can also imagine you have an intuition that, well, you should be able to maintain them, right? Like you insert a new vector into a cluster, and you expand the cluster.

Simon [48:50]:
But at some point, the cluster becomes large, and you have tail latency. Well, so if the cluster grows very large, you could split the cluster, or if you're removing vectors from a cluster, you could merge it with adjacent clusters. Now the problem is that that causes the clusters nearby to also potentially change because if you're splitting or you are merging clusters, vectors that were previously in those clusters might now better belong to other clusters. This is like a concurrency and implementation nightmare, as you can probably imagine. But this works unbelievably well at very large scale with a very large amount of throughput of indexing, and the performance characteristics on this on disk and object storage is really good.

Simon [49:30]:
We've had this in production now for almost a year, a year and a half, and we're now working on the new version of this that will be even more powerful, helping us integrate everything that we've learned about running this. This is certainly part of the core parts of turbopuffer. The other thing that we do is hierarchical clustering. So you can imagine that if you have a very large vector index that's like, say, a billion vectors large, scanning through all those centroids to find the closest clusters becomes a lot of compute. So we create clusters of clusters.

Simon [50:00]:
So essentially, we're creating a tree. And the other thing that we do is that we quantize, so basically take these big float F16 vectors and get them to be just binary vectors. We can do very, very simple operations on them. And then this algorithm called RaBitQ, which we use to quantize, gives us an error bound on which ones we have to re-rank with the full fidelity vector to get some recall target, which is to get sort of the, you know, these are all approximate algorithms, so we have to have high recall, which is sort of the percentage overlap with the exact ANN result.

Simon [50:50]:
So these are also some of the things that we've done to support some of these use cases where people are ingesting all of Common Crawl or other very, very large datasets into turbopuffer. We wrote a blog post about how we did this, and this is how we can search around 100 billion vectors with a p99 of 200 milliseconds. It's probably been brought down by then, but we've never heard of a result of that scale. Filtering is a big challenge with vector indexes, and I think a lot of people somewhat irresponsibly just slap a vector index onto their existing query planner.

Simon [51:30]:
The query planner has to be very aware of the fact that the clusters that are matching something might be very far away, and that can create all kinds of problems in planning the query where you don't actually get the closest vectors. Not going to go into a ton of detail on this here. We have a blog post with a little bit more, but just something that the query planner certainly has to be aware of the distribution of the filters with the vector data to plan a query with high recall.

Simon [52:00]:
We have three more indexes to go here. The full-text search index in turbopuffer is fairly sophisticated, and this is drawing a lot of inspiration from Lucene. We're starting to see comparable performance and even beating performance from Lucene on very large datasets for very long queries, which are the ones that we're increasingly seeing in production from LLMs increasingly writing the queries. Essentially, when you're doing full-text search indexing, you're splitting the corpus by, say, space or something like that, and then you're putting all of that into a map, and then you have a set of all of the documents that match that.

Simon [52:40]:
You also have to do all this stuff around scoring, and that's where we store with every block of documents, we store the maximum contribution to the full-text search score of the documents inside of that block. What that allows us to do is an enormous amount of skipping. And so you can imagine that for something like if you're searching for a term like "puffer," it's a lot more rare than a term like "the." And so the score in that block is a lot higher. We can use that further in the next slide where I'll talk about the algorithm we've used for skipping.

Simon [53:10]:
But we can use this information to rip through these postings and intersect them. Full-text search really just boils down to intersecting very, very large posting lists. And so it really becomes fundamentally a bandwidth problem. To use these posting lists, the very simple observation is that if you're searching for something like "New York population," you can imagine that there is a point where it just does not make sense to look at any documents that have "new" in them anymore. Like basically any document needs to have "population" in "York" to actually be viable for the query, and then we can stop intersecting the new list.

Simon [53:50]:
This algorithm is called max score, and it performs really, really well on modern CPUs. Recently, sort of the trade-off switched to be in favor of max score over wand, so what this boils down to is a lot of complexity in search engines is around skipping documents for ranking that no longer matter, and this is the one that we've implemented in turbopuffer. It's also the one that's in Lucene, and I think these are the only two implementations of it. Regex queries are another type of text query. The way that you service these is that you essentially break them and break the, you parse the regex into an AST, you break it into trigrams, and then you post-process on all the documents that match the trigrams that are definitely going to be in the documents that match.

Simon [54:40]:
This essentially just boils down to a full-text search query, but the number of trigrams is finite, so we can just index some numbers and save a bunch of efficiency here. So if you're searching for something like "main" and "util," we break it out into the trigrams, right? And there's going to be some regexes that are very difficult to compile into trigrams, and that's when you have to run the final regex on the document. This is very common for code. And then the last index that we have is the filtering index, so exact filtering. This has to work congruently with the vector queries and full-text search queries, and the planner has to be aware of when it's better to filter versus score because sometimes it's better to score and then filter, and other times it's better to filter first.

Simon [55:30]:
In a lot of ways, it shares a lot of sort of conceptual logic with the inverted index of use of full-text search, and underneath it's a simpler bitmap implementation that figures out what filters match into the heap. I won't go into too much detail on this, but essentially we've also written a queue on top of object storage to avoid having another dependency. This is not for the WAL or ingestion of data, but really just to notify the indexes of when to compact and do various maintenance operations. It's essentially just a single file called q.json that we do a lot of CAS on, and it goes through a broker that's doing group commits on this file with all the traffic from lots of query nodes. And this works great. It's a very simple design.

Simon [56:10]:
Again, we only introduce complexity when it's deserved. The last thing that's coming to turbopuffer very soon is, which is easier on an object storage native type of databases, branching. So basically the ability at a particular LSN or WAL entry to branch off and then start a completely new timeline and then have references back to the previous branch. This is common in coding, it's common for development, it's common for testing, and lots of other use cases. This is something that's coming to turbopuffer very, very soon.

Simon [56:50]:
Final slide here is that what's next for turbopuffer is to implement more and more and more query plans. There comes a time where you have to steer clear of the uncanny valley of continuing to use JSON syntax and actually give people SQL. I don't know when we're at that threshold, but I think we'll know it when we see it. Faster, of course, and then continue to iterate on caching and lots and lots of other features like branching that I just mentioned. I'll leave it there for questions. Thank you so much.

Andy [54:37]:
All right, that was awesome. I will clap on behalf of everyone. So if you have a question for Simon, just unmute yourself and go for it. And I know there's a question in the chat that I'll read out later on afterwards.

Audience Member [54:51]:
Hi there. Thank you. Oh, I'm sorry. You can go ahead first.

Andy [54:56]:
Okay. Yeah, so I just had a quick question if you could clarify my understanding of the caching. So you said it was a read-through and write-through cache. Let's say I have a search on Postgres, and then it's cached, and I write a document, like auto-tune Postgres autovacuum, and then I read that again. Is it going to get that document then right away, or how does that work?

Simon [55:20]:
Sorry, what's the Postgres connection here?

Andy [55:22]:
No, I'm saying if you have a search that's already cached on the term Postgres and then you write a document with Postgres in it, are you going to get that right away, or how does that work?

Simon [55:33]:
Yeah, so it will go in the WAL, right? And the WAL would be written into the NVMe cache. I don't think we write it to the memory cache too. And so when you're doing the query, right, we will immediately see it because it's in cache, and we're doing the round trip to the WAL to make sure that we see the latest entry. So it would be immediately visible and written into the cache because otherwise, every time you write a document, we'd have to go to object storage to fetch it.

Andy [55:58]:
I thought, but I thought you said you didn't acknowledge the write until it made it to object storage, no?

Simon [56:02]:
Yeah, I mean like you're not gonna see it until the client has returned, like the client sees the 200.

Andy [56:09]:
Okay, got it. Thank you.

Simon [56:10]:
Sorry, is it clear?

Andy [56:11]:
Yeah, no, that makes sense. Thank you.

Audience Member [56:12]:
Hey, thank you so much. Appreciate this talk. I had a question, which is what, if anything, have you noticed while building turbopuffer in terms of the difference between storing information like common text files versus code? And where have you seen like the most interesting differences?

Simon [56:45]:
Trying to think of a simple answer here. I think code is like extremely sharded, right? There's so many different shards from all of the companies that we have that use code. And the write-to-read ratio is very high. There's a lot more writes than there are reads. I think in other systems where it's simpler text, the queries often get more complicated. So in code, the queries are usually quite simple, like find this regex, whatever. There's generally very little filtering, whereas for text in a more of like in an Atlassian or like Notion use case or something like that, there's a lot more filtering, often a lot of permissions. I did not know how much time I would spend on optimizing or that we would spend on optimizing query plans for permissions, which are very, very large intersects and are very well-suited for posting lists over B-trees. So I would say those are some of the first that come to mind.

Audience Member [57:52]:
So thank you for this talk. I had a question. If you were not building turbopuffer, what would you be building in like Gen 3? Kind of databases, you know, the object native databases? Or to reframe it, like if you have to build another database company, what are the other things you would build?

Simon [58:20]:
This is a phenomenal question. Okay, so I spent a fair amount of time thinking about this. There's not that many companies to be built, unfortunately. There's a lot of databases to be built, but I don't think that many of them will turn into big companies. One was the streaming use case, right? So WarpStream and others have essentially done that. I think there were companies to be built there. They are not quite able to be propelled by the new workload in the same way, but it is just the superior architecture for streaming. The second category is search; that's the one we're in here. The third category is sort of real-time analytics. I think there's a bunch of different vendors that are doing that. It's not a category I would go straight for; it's a very contentious category.

Simon [59:10]:
The other category is observability. Almost every observability provider builds their own database in some way and then slaps a big UI on it, but it's not a place where you would actually be selling the database. You have to be selling like a full end-to-end experience. If I had to choose to build in another vertical, it would probably be there and try to do it not at the 85% gross margin of the incumbent, but more like a 60 or 50% gross margin and try to expand from there. But that would be very capital intensive, and it requires an enormous amount of UI work to pull off successfully. The other category that I think is interesting, other than the ones that I've just mentioned for this database design, is that there's lots of companies that can navigate a very specific set of trade-offs and create their own database a lot easier now than they would have been able to in the past.

Simon [1:00:10]:
This is probably where the majority of these databases are going to be built inside of companies that have a particular set of trade-offs, but that is not a generally viable large database company. That is my somewhat pessimistic answer to your question.

Audience Member [1:00:12]:
Thank you so much.

Andy [1:00:17]:
Okay, go for it.

Audience Member [1:00:19]:
Oh, thank you.

Audience Member [1:00:21]:
So I'm curious about the consistency model within like a single namespace. My understanding is that like with inverted file indexes updated like asynchronously after you have like a write-a-log like commit. So if there's like concurrent write and reads, is there like a risk of like getting like a mixed state where like some documents are like indexed like wrongly and like others like updated in time? And is this like a trade-off that's like acceptable in the targeted workloads? And like, do you see cases where like this could be like a real problem?

Simon [1:00:58]:
Okay, so just quickly to reiterate how we did the search thing, right? So if you're doing full-text search, if you're doing vector search, it's actually easier. So let's say that the WAL is like 100 MB behind, let's use a nicer number, 64 MB ahead of the index, right? Then what we'll do is we'll do, we'll go to the index, right? Consult the clusters, do that search, and then we'll replay the WAL on top, right? Well, we kind of do those two operations in parallel.

Simon [1:01:30]:
And then you have a consistent result. I think what you may be talking about is a situation where you might be updating a vector, and then in my second slide, right, I show how you can do some simple conditions and things like that, right? And so you could do a patch by filter, things like that. That requires running a query and then doing the update, and in between those events, it can be possible, right, for something else to occur. So you'd have to rewrite the operation, right? Turbopuffer is at a read committed isolation level, so that is the kind of thing that could occur, and where having multiple writers could, yeah, that's I think fairly common in databases and other systems and felt like the right trade-off for turbopuffer.

Audience Member [1:02:07]:
Thank you. I also have like a different question; it's more like related to business. So given like the querying for structure like S3 and GCS, like they're all owned by like hyperscalers, and they could theoretically build like something similar to investing like enough time and resources. So what do you think your motive?

Simon [1:02:28]:
Okay, so two things. Working with a hyperscaler is just something that every database company contends with, right? It's just the rule of the law, and inside of these large companies, the team that works directly with you is incentivized to have you grow as much as possible. And there's another team that's incentivized to grow their database as much as possible. And just these organizations like that works perfectly, and it's just like the incentives are there aligned, and I think that's well and good with all of the vendors. But of course, if they see a lot of commercial traction, they have the signals to try to build something similar.

Simon [1:03:12]:
And at least one of the clouds has done something like that that looks a lot like what turbopuffer looks like.

Audience Member [1:03:14]:
FPGAs, right?

Simon [1:03:14]:
Correct. But it doesn't have the cache hierarchy and things like that in front, right? So if you're building a database company and you're competing with the hyperscalers, you have to think about what you can do that they could never do, right? One of the things the hyperscalers could never do is to put the database engineers into a Slack channel with the companies that actually matter and have them just work directly on the query plans.

Simon [1:03:40]:
Another thing that's very difficult to do at that scale is to manage all the caching and the multi-tenancy and things like that, and then also releasing features at some particular cadence. I think a startup's only moat is focus; that's it. And so you can focus on something for a long period of time, I think you will do really well against the hyperscalers. If you think about, for example, even the Postgres offerings of the hyperscalers, most of them are not even on NVMe SSDs, even though those were released almost a decade ago on cloud SKUs. So there's lots of opportunities, I think, for database companies to even just run these databases better than the clouds are themselves.

Audience Member [1:04:15]:
I'll just point out you built V1 in a cabin in Canada and launched it, and we're not going to any design meetings with the team, right? So, like, you can't do that at a big, at a larger scale.

Simon [1:04:27]:
Yeah. No, no, no, you could not. You could not. Like that.

Audience Member [1:04:31]:
Yeah, of course.

Simon [1:04:32]:
V1 uniquely works, right, because you can be very close to the customer, right? And so like, if there's anything that comes up in a POC, you can give them an experience of trying to fix that immediately, right? I think maybe another way to phrase that is that when Notion became a turbopuffer customer, they were using every single line of code in the database, right? And that's an advantage that you have where you can launch a lot quicker.

Andy [1:04:56]:
All right. So we're almost out of time, as I appreciate you going over. First question would be. It sounds like all your file formats running S3 are all proprietary or custom stuff that you guys developed. At any point in the design or the building out of either V1 or V2, did you consider using one of the open-source file formats like Parquet, ORC, and the newer ones (e.g. Nimble)?

Simon [1:05:19]:
The main reason that we didn't go that route was because of speed. We just want to control. There's no philosophical debate. If everything could be in the customer's bucket in an open file format and we could move at the same pace, I would say that would be a superior design. But again, a startup's only moat, a startup's only thing that it has on everyone else is speed. And so we've always optimized everything for what is most reliable. Same reasons we're not open source; we're not like we're only commercial, not because we don't think that open source is great, but because we don't have time.

Andy [1:05:58]:
Yeah, sure. And then the last one, I think you mentioned you had a backup slide of your object storage bag of tricks. I know we're over time. Could you share that?

Simon [1:06:08]:
Yeah, we will turn this into a full-fledged blog post at some point, but there's a lot of different tricks that you have to employ when you're building for a turbopuffer or an object storage native database, right? So just to take a couple random ones, speculation is a big one, right? Like if you, you're always going to want to try to figure out or speculate what the next round trip is going to be and then do it in parallel with whatever you're doing. We use this trick everywhere. If you see the traces from turbopuffer, just constantly speculating on what we might need next.

Simon [1:06:50]:
The metadata layer for S3 and GCS and every all the other ones is much faster than the storage layer, so fetching like a 4 kB metadata file is maybe like has a p99 of say like 50 milliseconds, whereas just getting the metadata of is this the one I have locally, the same game might have a p99 of closer to 30 milliseconds depending on the provider. The P50 is generally a lot lower for the metadata, but when we design, we always design for the p99. Another random one we talked about list earlier; one of the other things that we do is that we keep a read-through cache of a bunch of the starting points in the bucket of their places to start from, so we can parallelize and minimize the list.

Simon [1:07:48]:
We run thousands of concurrent list calls and know where to paginate from until they overlap. 404s are free on GCS; you can do crazy things with this if you really consider it. They're not free on any other provider, but if you're only on GCS, this is very interesting.