Building a database on object storage

November 16, 2024•The Geek Narrator Podcast

Transcript

Kaivalya Abte [0:00]:
Hello everyone, welcome to the Geek Narrator podcast. I am your host Kaivalya Abte. In this episode, we will talk about how to approach building a database on top of object storage, what the key challenges are, and how we can learn from Simon's experience building turbopuffer. So welcome, Simon. I am really excited to have you on, and let's start with a little bit of introduction about yourself, your background, and how you ended up building turbopuffer.

Simon Eskildsen [0:27]:
Thank you so much for having me on, Kaivalya. I'm really excited to be here. Yeah, we can start with a little bit of background. I grew up a couple of hours north of you in Denmark, and I left to work at Shopify after high school, thinking I would do a little gap year in Canada working for what was at the time a small successful Rails shop. And then I got super interested in the infrastructure that powered it, specifically the databases. When you're working on a SaaS app of that scale, typically the thing that breaks the most is whatever has the most state, and what has the most state is the database generally. So I spent almost a decade working there from 2013 until 2021 on mainly the data layer of Shopify to make sure that the product ambition of Shopify was not limited by the infrastructure, and even more so the customer ambition because we had some of the big West Coast celebrities that would sell an enormous amount of product—Kylie Jenner and Kanye, and they were all somehow in the Kardashian family or affiliated. I joined when we were doing a little less than a thousand RPS, and when I left, it was doing more than a million at scale. As part of this, you work on almost every single part of what powers the data layer underneath Rails, but just above the database. Of course, you need extensive knowledge of how the database works. When you're scaling that fast, you don't have time to write the perfect database for your workload; you have to do the best with what you have, and that's what I think we did. I worked on moving Shopify into multiple data centers, sharding MySQL, sharding everything, running out of multiple regions, failing over inside of a Slack bot, writing the storefront to be better for caching, and then multi-tenancy and protection was a big part of it. I left in 2021 and spent some time with my friends' companies helping them with various scalability challenges, kind of hopping around and helping them in quarterly increments. These days, the main database challenge seems to be tuning auto vacuum in Postgres. This was the common denominator across almost everyone that I worked with. Of course, working with MySQL for a decade, this is not really the issue. There are other issues, but that is not one because the way that MVCC works is different. One of the companies that I worked with builds an application called Readwise. What it allows you to do is take the articles that you read online and put them into a queue so you can read them and things like that. One of the things we wanted to do was article recommendations. They asked me, "Can you spike a little project to do article recommendations?" I said, "Yeah, sure." We did some experiments using embeddings. So embeddings, right, you take some data—unstructured, semi-structured, or completely structured—and you turn it into a vector. For your audience that may not all be familiar with what this is, that might be a quick explanation for what this is. You might use something like Spotify; this is literally how Spotify works. You take a song and feed it through the model; the model will spit out a coordinate in a coordinate system. You can imagine it in two dimensions, right, like X and Y. Here, all the songs that are rock are here, country songs are here, and in between, you have the artists that span. In order to do recommendations, for example, it might be your catalog. You can imagine that you will color a bunch of these coordinates that you've listened to, and then we can find recommendations by finding adjacent songs to those and doing a little bit of geometry to figure this out. That was what we were going to do for these articles. We were going to take the articles and chunk them up into pieces and turn that into vectors, and then we could look and say, "Well, these are chunks that you've read in the past." Maybe you've read about how to keep a baby alive if you're a new parent. Maybe you're reading about object storage, whatever. That's going to really challenge vector space. Then we could find similar chunks. We did some experiments with this, and it was promising. It would take some more tuning. But at this point, I sort of ran the back-of-the-envelope math on taking all this data that was inside the Postgres database, chunking it up, putting it into a vector database, and then doing the recommendations. This company was spending around $3,000 a month on their Postgres database. The back-of-the-envelope figure for doing this with a vector database, the ones that were available at the time, was $30,000 a month. You can't eat $30,000 a month, even if this was worth it from a product perspective. The fact that it's 10x what you're doing with your relational data—the database where if it goes down, your entire company is down—you can't do that. This is just like an unfeasible thing. So we kind of put this in the bucket of, "This is going to get cheaper," right? Like everything else, the context window is going to grow, the tokens are going to get cheaper; surely someone's going to figure out how to store the coordinate system cheaper, right? But I couldn't stop thinking about it. The outages of running Elasticsearch, which is operationally, I think, one of the most challenging databases that I've ever operated from my time at Shopify, were flashing before me. To me, it's always been very obvious that for a search workload, you have to separate compute and storage, and no one had really done it before. If you kind of imagine a spectrum of OLTP to OLAP, you can imagine that for OLTP, you need very low latency, low latency writes. Everything happens like sub-millisecond if it's a point lookup, but at least single-digit milliseconds in the worst case, right? If you're writing a big Rails application, you're depending on all of these point lookups being hundreds of microseconds because you're doing them sequentially. Now, on the other hand, you have the OLAP solutions, right, that have separated compute and storage now for a long time where query latency in the seconds is completely fine, right? You're doing a lot of writes, and you're doing a query once in a blue moon to generate a report. Search is somewhere in between. You're actually okay with maybe 30 to 50 milliseconds of latency; you're processing more data, but your write-to-read ratio is enormous. It's not uncommon that you're writing like 100 times more than people are searching for stuff, right? The ideal architecture to me just looked a little bit like a hybrid between the two. You need really good cache locality, but having unbounded compute is also there. The other thing too that you notice, and anyone talking on search will talk about this problem, is that if you're doing search to do good search, you've got to do a lot of experiments. Some of those experiments you can kind of tune some parameters and see that your search results are improving, but for other experiments, you have to rewrite all the data to add some more data. Maybe you want to weigh in the popularity of the content, how much it's been clicked, how recent it is, and you didn't do that the first time around, so you're rewriting this data a lot. Often, you're provisioning for your steady-state write load. Now, if you have to readjust everything, it's often very computationally expensive. To me, this object storage-first architecture just seemed really like the right fit. Back to Readwise, I just had a moment of like, "Okay, you know what? Screw it. I'm just going to sit down and try to write this thing." That's what I did last summer, and then we released it in September or October of last year.

Kaivalya Abte [8:02]:
Amazing. That's a very interesting story, right? You talked about two very important points: how the monthly expenditure on Postgres went from $3,000 to $30,000. That's quite a lot. The second one was about how to keep a baby alive. That's so important, right? I am a new parent; it's really a tough problem. But anyways, we are here to talk about databases. Thanks for sharing the background of how you ended up building turbopuffer; it's important for people to connect more. I want to understand first, is the problem only the amount of data that you're storing, or along with it, do the computation requirements also grow? Specifically for the example of going from $3,000 to $30,000, what was the exact reason for this increase? Is it just the storage requirements because of the number of vectors you're storing, or is it also computation attached to it?

Simon Eskildsen [8:59]:
Yeah. First, I just want to clarify something. The $30,000 a month was to do the vector workload of a subset of the data in the Postgres database that cost $3,000 a month.

Kaivalya Abte [9:09]:
Okay.

Simon Eskildsen [9:09]:
It was not the cost of adding vector indexes into Postgres. That may or may not be lower; unclear. What you're asking about specifically, though, was around why this architecture is good for certain workloads.

Kaivalya Abte [9:26]:
Yeah.

Simon Eskildsen [9:37]:
So a write to object storage is generally around 100 milliseconds, a little bit higher if you're doing multi-part for uploading a bunch of data. You're going to have to do multi-part if you want to exhaust the network, but ballpark in the hundreds of milliseconds. If you are a transactional database and you're doing lots of writes and doing transactions, this is too high, right? It doesn't really work. But for a search workload, it's perfectly acceptable, especially given that the throughput—there's really no constraint. You can do much higher throughput than a traditional OLTP database, right? The other thing is that these vector indexes are enormous, and that's what's really forcing this change. If you think about it in terms of data, right, you might have an article that's about a kilobyte of text, and that kilobyte of text, if you transform that into a vector, that vector might be in the ballpark of six kilobytes. If your vector data is actually larger than the original data itself, it gets worse than that because what you end up doing is that often you say, "Well, I can't—I don't want to do a vector embedding of the entire article; there's too many things going on here." To generate a good representation in vector space, you chunk it, right? So you chunk by paragraph or something like that. Let's say there's a one-kilobyte article; now it turns into four vectors, each one of those vectors is six kilobytes. Well, now you have like 24 kilobytes of data, and your space amplification on the original data is 24. So this is really the forcing function: there's just an enormous amount of data; the space amplification is phenomenally high.

Kaivalya Abte [11:23]:
Yeah.

Simon Eskildsen [11:24]:
That's what warranted me to start looking at a different architecture for this altogether. On a traditional architecture, either an OLTP or something like Elasticsearch that works somewhat similarly, when you do a write, that write goes into a primary and then is often replicated to two secondaries. That means that you're storing every data on disk three times. Generally, you store this on an SSD or a very high-performance network disk like EBS or whatever. The reason is you do that is because at any point you could fail over to those other replicas, right? That's part of the point of having them, so they need to have the same thing. If you want to buy a gigabyte of EBS, like networked storage, which is the most reliable storage that you buy in the cloud among block storage options, you have to pay about 12 cents per gigabyte. So if you have three gigabytes of data because you have to store it on three nodes—the primary and both secondaries—now you're paying like 36 cents per gigabyte, right?

Kaivalya Abte [12:29]:
Yeah.

Simon Eskildsen [12:30]:
Because it's 12 cents. So that means that you're running at 100% utilization, right? That's like every bit is filled; there's no excess space. Typically, you're going to run this at 50% utilization to not get nervous, right? Maybe you do 80% or something like that, but that also requires a lot of ops if you want to resize that much. A lot of production requirements run with less than that. But 50% utilization—well, now you have to double this number. So if you want to double that number, now we're talking more like 70 cents per gigabyte. If you're doing a managed offering, you kind of want to see at least a 50%, preferably an 80% profit margin on that kind of storage. So now, like, the actual cost of goods sold, the base cost is almost 70 cents per gigabyte. But the cloud provider has to try and charge you like double that in order for them to offset all their other spends, right? Often three times. Now with object storage, the cost just looks completely different, right? The cost of goods sold is that every gigabyte I put on object storage is already replicated, and it already has more nines of durability than any other system on earth. It's already replicated; you benefit from their multi-tenancy, and you're paying two cents per gigabyte. Then you put it on an NVMe disk, and that NVMe disk costs you—and it's not a network device, but this is like an actual NVMe disk, right? SSD that's attached physically to the box that you're on, so you're not using the network when you are with a network disk. It's much faster, it's much cheaper, and has amazing performance, and now you're talking about 10 cents per gigabyte. Then you can charge a margin of that that's actually cheaper than hosting Elasticsearch or something else like that with a more traditional deployment model yourself. The economics of that, given that the biggest trade-off here really—the only real fundamental trade-off—is write latency. Especially if you wanted to do transactions, this is a really good compromise for search. This is not a good compromise if you're writing like a checkout engine or inventory management or any of the problems that, for example, a Tiger Beetle is solving. It's a terrible set of compromises, but for something like search, it's a phenomenal set of trade-offs.

Kaivalya Abte [14:50]:
Okay, so we talked about trade-offs, right? I want to dive deeper into the topic at hand, which is how do you think about building a database on top of an object store? Since you're talking about trade-offs, it's clear it's more challenging for you to build a high-scale transactional database on top of object storage than it is to build a vector database or a search database, right? Because of the expectations or the different trade-offs in terms of write latency and read latency that you have. So I think the first thing we should talk about is, let's take a step back, right? You mentioned the motivation of thinking about object storage, which is probably cost efficiency, which could also mean less operational overhead because a lot of availability and durability guarantees are provided to you by the object storage. So how do I approach this problem? When I think about, "Okay, let's say I want to build a database on top of object storage," the first thing I have to understand is the trade-off, and that's clear from how you defined it. What are the next steps? How should I think about it? If I look at a high level, like an architecture, it is clear that there's object storage, which is my main storage component, but then I have to also have some caching layer and then the clients. How do you think about that? Let's start from there, and we can unwrap each layer at a time and talk more.

Simon Eskildsen [16:14]:
Why don't we just talk through then? I think this is what you're getting at—designing the simplest possible database on top of object storage, and then we can talk about how to optimize every part of it.

Kaivalya Abte [16:25]:
Exactly.

Simon Eskildsen [16:26]:
turbopuffer aside, but the experience, of course, of building it is very much in hand. So the first thing you want to do, right, is that you have to accept the write. There are no reads without doing a write. So that's the first thing you do. In the simplest possible way, using object storage, what you would do is create a directory or a prefix called write-ahead log, or WAL for short. Every time someone does a write, you put a file inside the WAL directory, and you just increment the number. So the first one is called 1.json, 2.json, 3.json, 4.json, right? The write-ahead log is where the data gets appended; any new data lives in there. If you delete a record, it's just registered as, you know, in the record, it's like type, and the type might be insert, the type might be delete, and that's really all you need. Then you have the actual record with the data, which would just be, let's say this is a simple JSON MongoDB type of scenario. So this is your database, right? Like, really, this is a database in itself. You can start to satisfy queries on top of this, right? So what we do now is that we have a node, and there's a binary on that node. This could be written in whatever language that you choose. Now I do a select all from table. What that would do is that it would just go into the write-ahead log and stream through it, respecting the deletes and the inserts. Okay, now we have a database on object storage. That's all you need. Like, you know, this can do any query. It's just going to be a very slow full table scan, right? That's a real database. I think it's going to be very hard to write something that's much more reliable than that because none of those files are ever going to get lost. You have like full time travel. Someone could send you a query to say, "Hey, I want to do this query as per the fourth entry," and then you just do a full table scan up to that. This is a great database; this database can do a lot of things. It scales quite terribly, but from here on, like, that's a real database you could ship that, right?

Kaivalya Abte [18:41]:
Absolutely. But how do you query the data in an efficient manner? So, okay, but I want to take a step back, right? When we talk about traditional databases, we say that, let's talk about PostgreSQL, for example, you have your data pages, and then you also have your WAL, and you write to a WAL to ensure durability, and then you write to the buffer cache, and then you periodically flush that and make the entries in the real data pages. I think the reasoning, if I remember correctly, is that writing to a WAL is much faster than updating the data pages. But if I think about object storage, think about a WAL versus maintaining it in a structure which is efficient enough already for reading data. How do you compare these two approaches? Why do we need a WAL when we are using object storage?

Simon Eskildsen [19:47]:
Yeah. I mean, I think the best way to answer your question might just be to continue to evolve on this database, right? We can think of SQL queries that we want to execute on this database, and then let's see the design evolve, right?

Kaivalya Abte [20:00]:
Yeah.

Simon Eskildsen [20:01]:
Then I think let's backtrack and then do something a bit more akin to what you're saying, which I think would be a way of approaching it. But it's not as simple. I like to just build up in complexity and defer it until we need it. That's how you ship fast, right? So let's say now that one of the things that you want to start doing is that you want to do an efficient lookup, right? So you want to do select all from where ID equals something, right?

Kaivalya Abte [20:33]:
Yep.

Simon Eskildsen [20:34]:
The way that we're doing this now is that it's pretty slow, right? Because now you have thousands of entries in the write-ahead log, and you have to stream through all of them. We haven't implemented any optimization. So what ends up happening is that on every query, we go and get thousands of objects, right, from object storage. This is slow because let's say that this is like gigabytes of data or megabytes of data. You can saturate the network, sure. Depending on the box size, we're talking into tens of gigabits per second, so that's going to be a concern. So what we could do is that while we know that the write-ahead log is immutable, so 1.json is never going to change, right? So what we could do is we could start putting those in a cache. We could put them in a memory cache. That's the simplest. So now every time we go to S3 or something S3 compatible to get the WAL, we first look and see in the memory cache whether 1.json exists or whatever, right? Then we scroll through them in memory. That's probably going to be pretty fast. I think then, you know, you could do more cache tiering like what turbopuffer has of also having an NVMe SSD cache and things like that. You could actually build a database that could do a lot of things just on top of the write-ahead log and streaming through it. Now, what you're getting at before is to start building some kind of derived indexes on top, right? Even potentially one of these indexes could be the primary data storage, right? So that you don't even write to the write-ahead log; you just update that primary index where the data is stored as the primary one. I think we can get into that in a second of why that might be a good idea and why it might be a bad idea, but right?

Kaivalya Abte [22:21]:
Yeah.

Simon Eskildsen [22:22]:
Now we have this point lookup; it's actually pretty good now. It checks the in-memory cache first; if it misses that, it goes to the NVMe SSD cache. It's still scrolling through everything until it finds the ID, but it has to go all the way to the end because you could have upserted it multiple times. So that the stream through everything, we could stream in reverse so that could be an optimization we can do, and then we might find the ID earlier. But we might do things like this, but of course, if you want to make this fast, you've got to start creating some derived indexes off of the WAL data. We might say, "Okay, we're going to do point lookups because we want to have range scans and things like that." If you're only doing point lookups, you can do a hash map. If you want to have range scans and some ordering and things like that, you might do a B-tree. So we could create that, right? So what we do is that we could just build it in memory. So we have the entire write-ahead log; every time you ask a question, we have an in-memory B-tree map, and we just consult that for any updates, right? Every time a query comes in, we just look in there and see if it's there. If it's in there, we just serve it out of there. Now you might start to think about consistency, right? Because now another process might be writing and have added, you know, 3.json, and the B-tree map was built at 2.json. Do you have to now—you have a decision point. Do I want eventual consistency? Do I want strong consistency where you can read your write immediately, right? Like I insert and then I see it immediately. So you have a decision point. You could decide that now what you want to do is, "No, I want to be consistent." That's how I want to do it. So what you're going to do is that every time a query comes in, you do a list call to S3 in the WAL directory before you do everything else. Then, of course, if all those files are in cache, you're good to go, right? You know that you've updated the index up to that WAL pointer, and you can just serve that out of the cache. You have to do the list call. Those list calls might take a couple of tens of milliseconds, but you have a consistent database, and it's really simple to run. If your query node goes down, you're good. Now you might be in a scenario where you're starting to get a lot of queries, right? Your little box that you have this run on can't do anymore. You've been optimizing all of this; your B-tree map is pretty good; you're doing all this async indexing, and you're doing all this read coalescing, and you've done all the optimization you can figure out with performance. So now you have multiple nodes. You have the write to the WAL for multiple nodes, right? Maybe a query can come into any of them. So what you do now is you do a list call before you do the write and like, "Okay, it's currently at four; I'm going to add five." Then you use a compare-and-swap kind of mechanism to make sure that you can only put it if it doesn't exist, and you can start to add layers of complexity like that.

Kaivalya Abte [25:08]:
Interesting. Yeah, so just to recap a little bit what we have by now, right? We have written to a WAL that's there in the object storage. We have also built some sort of indexes like for key lookups, key-value lookups; we have a hash map. For range queries, we can have a B-tree. For consistent writes, you can also keep these B-tree indexes in sync by making a list call to object storage and so on. Let's take a step back. While I'm writing this to the WAL, what are the things that could be challenging because it's on an object storage? For example, if I have two writers trying to write a file, and the other one writes 2.json already, and the other one is trying to process, and then as soon as it has enough data, it also writes to 2.json. How do you handle this kind of conflict between two writers? Let's start from there and identify some other challenges we might be facing.

Simon Eskildsen [26:05]:
Yeah, I think there's probably five different things that you could do at this point. So we can go through them, right? It's another decision point. We made a decision earlier that we're going to be strongly consistent. Now we're at another decision point of what do we want to do if we have multiple writers? So one way that you could solve this is by deciding that you never have multiple writers; you only ever have a single writer. This rule will have problems because you kind of have to enforce that somehow.

Kaivalya Abte [26:32]:
Yeah.

Simon Eskildsen [26:32]:
How do you want to enforce that? That's another decision point, right? So we're not going to walk down that decision tree. But that could be one way. Another thing that you can do is that you could bring in some kind of lock service, right? Because you could sort of say, "Okay, while I'm writing to the WAL, I have to have a lock." So I go to ZooKeeper or I go to etcd or I go to some kind of consensus layer and ask the consensus layer for permission to write to the WAL, write to the WAL, release the lock. Now, if you have multiple writers, that's going to work just fine, right? One comes into node A, writes, and then node B is waiting until that write, and there's some fairness there. The third option, I guess, we're at now that you could do is that when you do the write, the first thing you do is you do a list. The list might come back and say the WAL has 1, 2, and 3, and then you should be writing 4. So then you write in 4, but you put a condition on the write operation. All the major object store providers support this, where you can say that if 4 exists when I write it, reject the write. It only allows you to add it, right? Then it comes back, and you can decide what you want to do. Do you want to fail, or do you want to just try again with a bigger number, right? Of course, this changes kind of the semantics around what database guarantees that you get in because you now can't guarantee that 1 and 2—that's also the order that the writes go into. That may or may not matter to you. The fourth option that you could do, actually, let's keep it there because I think the other options I can think of are derivatives of either not caring with a single writer or doing something consensus-based or then using object storage itself. Every other variant I can think of off the top of my head are variants of those.

Kaivalya Abte [28:29]:
I think conditional writes are now supported, so that is a good way to handle this conflict. You mentioned something about the ordering of the writes, and in some cases, that could be extremely important. The traditional way of handling that is to have a single writer so it can order based on some timestamp or properties. But let's say we have multiple writers, and then, of course, there can be different failures. One write can be slow; the other writer can be fast and so on. In that case, how would you think about ordering guarantees if that was a constraint?

Simon Eskildsen [29:04]:
Yeah. Oh, there is a fourth option here, which is relevant to what you're discussing that's very simple, which would be to use some UUID v7 or something like that, right? Like a semi-ordered UUID with the first, I can't remember, like 50-something bits or timestamp, but the rest is randomized.

Kaivalya Abte [29:20]:
Yeah.

Simon Eskildsen [29:21]:
You could do that, which has similar properties. What kind of guarantees do you want around writes? I mean, there's a layer here, and there's extremely technical vocabulary of how to use this, and I get it mixed up all the time, and I think it would be very hard to parse in audio format. So we're not going to go too deep into it, but rather just talk about this at a high level. The first thing, right, is just how do you resolve the multiple writers, right? Like there's two writes coming in at the same time. Do you guarantee that if this one, this request was accepted first, that the write then comes before the other one, right? You might be writing the same record. I think for some workloads, this may be very important, right? There are different isolation and guarantees that you can have around this transactional nature of it. I think that for something like search or something else, it becomes a little bit less important. I think that generally what you would try to do in systems like this is to do as good of a job as you possibly can to make sure that the writes come in in the order that they arrived, typically through a single writer. The multi-writer scenario happens more if there's some kind of outage, right? If you want to have very strict guarantees around the ordering and how things come in, you can do that even with these models that I propose. But I think it might be getting into the weeds of the transactional guarantees, the isolation levels, and all of that in a way that may not be super fruitful. I think it's maybe more interesting to talk about that, of course, when you return a successful response back to the client, we know that it's been committed to object storage, which is like a very strong guarantee. Another type of guarantee that's very easy to reason about is that everything that's been committed together in our WAL database is going to be applied at the same time. So you might have a WAL entry that updates 1, 2, and 3, and all of those things will always be seen at the same time, right?

Kaivalya Abte [31:24]:
Yeah, that's right. As you mentioned, there are different trade-offs for each of these approaches, and it all depends on the guarantees you need. We don't want to go into the realms of transactional databases where there's really strong guarantees, but still be in the search space. Probably taking an example would be helpful here, for example, from Shopify or something. There are different aspects of writes as well. For example, I want to talk a little bit about how frequently do you write data to object storage because there is also cost attached to that part, right? How do you approach that kind of trade-off between the number of writes to the object store versus caching or batching them together?

Simon Eskildsen [32:06]:
For sure. Yeah, I mean, let's phrase it in terms of our very simple database, right? We'll call it SimpleDB. In SimpleDB that we're talking about here, you might have a scenario where, yeah, someone is clocking in a couple of writes per second. Of course, you could do 1.json, 2.json, and so on for all those writes. But if they don't care a ton about write latency, what you might decide is to coalesce those writes. In a traditional database, this is called a group commit. If you think about, for example, something like MySQL, if you're doing thousands of writes per second, you may have performance issues if you try to sync to disk for every single write. I think I wrote an article once called, "Is the number of transactions per second equal to the number of fsyncs per second?" That was the simple hypothesis going on, right? If an fsync takes one millisecond, which it often does on a modern SSD, by the way, I collect all these numbers on this GitHub repository called sirupsen/napkin-math, where I have all these numbers. One of those numbers is the cost of committing durably to an SSD. So, okay, if that takes one millisecond, that means that you can do a thousand transactions per second, right? Well, no, because MySQL is smarter than that. If you're updating one byte there, two bytes there, and so on, all these different records, we can coalesce that into however much data the disk can durably commit at once, right? An fsync can, I think it depends on the disk and the operating system—more to disk than the operating system, right? But let's say that it can commit four kilobytes at once. Well, if every single write comes down to 100 bytes, well, now you can commit 40 things at once rather than one, right? So now instead of committing a thousand times, like one over one millisecond, which is a thousand fsyncs per second, you can do a thousand times 40, so that's 40,000 transactions or updates per second, right? So that's one thing. So we could use that constant in object storage; the numbers look very different, right? A commit takes hundreds of milliseconds depending on the size of the commit. So we might decide that every second we only flush a handful of batches to object storage—whatever lines up with those hundreds-of-milliseconds commits. It comes down to cost, right? So it's really more of a latency requirement of like, "Okay, 250 milliseconds or so for writes is acceptable to the users." We're going to create windows of that, and then you can imagine kind of like, "Okay, you're in the first 250 milliseconds of a second; we coalesce all those writes." There might be like hundreds of writes coming in, and you just literally sleep inside of the node that's accepting the writes, and then you coalesce them together, commit them, and then you send it back to all the clients, right?

Kaivalya Abte [35:06]:
Yep. Now I think that's an interesting point. I guess there was also a discussion when we talked about SlateDB with Chris the last time. There was one interesting point where SlateDB does it based on the time, let's say some linger-ms configuration, which has, let's say, 100 milliseconds or something, and then for 100 milliseconds, it will just try and batch and collect it in memory and then flush it to object storage. But then there was another question that I asked; it was about the amount of data. What are your thoughts on that? Waiting for a certain duration, like you mentioned, like sleep for a certain duration, and that sleep is defined by the time, let's say the wall clock. How do you compare that with the amount of data, and what kind of workloads would make sense for the other approach?

Simon Eskildsen [35:51]:
Yeah, that's an excellent question. The way we think about it, the model that I'm presenting here, by the way, is a heavily simplified version of what turbopuffer actually does underneath the hood, right? We're just starting from first principles here and building up, which is my sport.

Kaivalya Abte [36:06]:
Yeah.

Simon Eskildsen [36:07]:
And so time-based versus size-based, it's really a cost question, right? The fastest thing that you can do is that every single time a write comes in, you just commit it to object storage, right? Your conflict resolution is not this incremental simple thing we just talked about; it's something different we could talk about what that could be. But that would be the absolute fastest way to do it. The whole reason to coalesce the writes in the first place is to reduce what we call the put amplification, right? When you're talking about the database, there are different types of computation that matters. There's the get amplification, which is how many S3 calls that you're doing. A million gets cost about 40 cents. There's the put amplification, which is the number of operations where you're writing or doing something write-related, like listing, which is also expensive, or multi-part start, multi-part commit, and a multi-part part number upload. Then there's the storage amplification, right? Which is how many times are you storing an individual byte? There is the compute amplification, which matters more for building indexes. There is the read amplification; there's the write amplification. That's been very well studied, right? Any LSM paper out there is going to just talk about storage amplification, read amplification, write amplification, the write workloads. I'm sure Pavlo, Andy Pavlo, has written a beautiful paper putting in a diagram like which compaction algorithms optimize for what, right? But with object storage-based LSMs and databases, we have to walk another two dimensions of compromises, which are the get and the put amplification. When we write to the WAL, that's really what we're balancing here, right? The most optimum fastest thing to do is just to commit every single write to object storage. If we coalesce them, we reduce the upper bound of the put amplification to four puts per second with the scheme that I just mentioned. Or maybe we're okay with even higher or lower put amplification, but even higher write latency, which is what we're directly trading off here, saying that we're going to commit once a second now. Then there's a tunable here, right? Which is just tunable between these two extremes of, "Okay, like my 250-millisecond window has grown to a gigabyte; now I'm going to upload that." You can saturate the network doing that. If you can assume that you saturate the network, then it's not really a huge issue. But if you drill down even more, in order to saturate the network with a single gigabyte chunk, you have to use multipart uploads with carefully tuned part sizes. So you may instead decide that you're going to split this into individual chunks and do individual puts instead of multipart. But now you have individual files, which increases the read amplification and the get amplification down the line. I'm not saying this to intimidate you with the complexity, but more so that this is a set of compromises. I think the most important compromise for an early database is simplicity. What is simplest to implement? Right? Then you go with that. Once you start finding from real workloads what the problems are, or maybe you know so much about the workloads that you can infer it, then you can graph all these things and then choose the optimum point for flushing at whatever interval. But a blanket answer doesn't really exist, right? In turbopuffer, we've settled on a couple of standards that we've, you know, binary search and bisected our way to by customer requests and seeing the different trade-offs there and optimizing our cost of goods sold, right?

Kaivalya Abte [39:45]:
Yeah, I think SlateDB also uses the simple approach, which is just wait for some time and then try to upload. Probably when we build this database, we have more use cases; we can also give this choice to the user that, you know, you decide because the cost is directly impacting how you are charged. So it could either be a hybrid approach or just, you know, one of these two. And that makes sense; it's all about trade-offs. Talking about writes even further, I want to take a step back and, you know, talk about how these writers are even coming and becoming part of our, you know, let's say the cluster. Since we were talking about conflict resolution, that one writer is trying to write this file, and let's say 2.json, and the other writer is also in progress. Then you have to use the list operation, which is even costlier. Can we not do something like when the writer comes up, they register themselves in object storage and probably get a writer ID or something? Instead of using 2.json, we could just suffix it with the writer ID as well, so we never have this kind of conflict. What are your thoughts on that? Have you seen this approach? Have you evaluated something like that?

Simon Eskildsen [40:57]:
Yeah, so you want to put the ID in there, and then you can accept the write from any node. Yeah, 100%. I mean, to me, this is just a variant of doing the UUID v7, right? It gets a little bit more difficult to argue about what the ordering of operations are, but you can definitely do that, right? I think the isolation guarantees you get are not as high as with the other model. But again, this is like a different set of trade-offs. I think that would absolutely work. I think if you're doing that, you might as well just do a UUID v7 or something even simpler so you don't have to keep track of or like, yeah, it's not a meaningful number, right?

Kaivalya Abte [41:37]:
Absolutely. I guess, I mean, I think we should talk a little bit more on, you know, the read path and how do we make it performant and the caching and tunable, you know, caching and all of that. Probably then we can talk about, you know, what happens when the writers go down. So doing a little recap, we have our WAL in our object storage; we have also built some sort of indexes like hash indexes, binary trees, and, you know, in memory. We have some tunable properties on how we, you know, flush data to object storage. We have some, you know, trade-offs there to understand depending on the workload. Now, let's say our range or hash-based queries are fast enough. What's next? How do we make the reads even better and more cost-efficient? Because I guess it all depends on the amount of size that it could cache on each node, right?

Simon Eskildsen [42:25]:
Yeah, I think, I mean, the trouble with this architecture, right, is that we talked at the top about the cost of doing various things, right? The memory that you get in the cloud usually costs you about $2 per gigabyte. The architecture we're talking about here works great, right? But it's assuming that we essentially just hold all the data in memory all the time. So we're just using object storage for persistent storage, and then really, you know, if you think about it, a database is just kind of like a cache or derived indexes over the write-ahead log, right? In our case, we've just chosen to put everything in memory, and that's fine. But we have a couple of problems here on the read path. The first problem is that a cold query is going to be incredibly slow, right? It has to saturate the network reading the write-ahead log, which could be millions of entries. You have to download everything; then you have to replay the entire log. Especially if it's JSON, you're going to be spending a lot of time deserializing and doing a lot of memory copies to get to the end to create these derived indexes. You might think, "Well, how often does that happen?" Well, every time the node goes down or you have to upgrade a version or you deploy, this is going to happen unless you do something even more complicated to keep track of that memory. So this is our biggest problem right now: cold queries are slow, and the query nodes are very fragile. As soon as a query node goes down, you're going to just be down until the entire write-ahead log has been written.

Kaivalya Abte [44:02]:
Yeah.

Simon Eskildsen [44:03]:
What we have to do now is we have to take these derived indexes and start persisting them to object storage. So what we're going to do is that when we build that B-tree at the WAL entry forward on JSON or whatever, we also take it and put it into object storage. So we now have B-tree.json. Let's say we're serializing it to JSON, and we put that there. At this point, you might be using some kind of binary serialization, right, of whatever programming language you're using to just serialize the struct. But let's just say that it's that for simplicity. Inside of B-tree.json, you're also going to have a pointer to say this is an up-to-date B-tree as of WAL entry 4.json. So now when you have a cold query, what we can do is we can just download the B-tree map from object storage and put that into memory and then replay the log again if we're still strongly consistent from 4.json onwards, right? At some point, we'll flush a new B-tree map up to object storage. This is better, right? Now on a cold query, we have to download just this derived data. But you're also going to run into trouble with that eventually because that is an enormous B-tree map. If you have a database with millions or tens or hundreds of millions of entries, you have to download the entire B-tree map and then start searching.

Kaivalya Abte [45:25]:
Yeah.

Simon Eskildsen [45:25]:
Now you're starting to get into the more complicated realm of, "Well, I actually want to walk the B-tree map on object storage, right? So I want to go there, and then I want to get the root node of the B-tree with one round trip, right? With a range read because I know where in the B-tree the root node is. Then I get the children from the root node, and then I do the actual tree search directly by negotiating with object storage." The problem with that is that the object storage P90 is somewhere around 250 milliseconds. It depends a little bit on the size of how much data you're requesting, but generally, you can get it to this if you're willing to get a bit of complexity and some hedge requests and stuff. This is problematic because, you know, you have to search log n layers of this B-tree. So for a small one, right, like what is it for a million? It's like 20 or something; log 2 of a million is 20. 20 round trips. So 20, like times 250, or let's say the P50 is 50 milliseconds. So now we can do this in about a second to walk the B-tree. It's still better than downloading the entire thing and putting it into memory. You try to do a lot in one round trip and minimize the number of round trips. So with our B-tree now, we say, "Okay, we're going to have more leaves per layer," and this is essentially part of what a B+ tree and some of these optimization and derivatives do is try to create each layer and make them fatter. So we're going to do that. We might also try to cache some of the upper layers if we have a lot of data, but now we're down to maybe doing three or four round trips, right? Then we can get a lot better performance into hundreds of milliseconds even on a cold query. Now, of course, after the cold query, it's in memory, and then we only have to walk the WAL and do this consistent pointer thing, but we're in a pretty good state now.

Kaivalya Abte [47:23]:
So instead of rebuilding the entire B-tree, you're persisting that in object storage, and whenever something goes wrong or you need to scale up, you can just read the B-tree and start from there. So maybe there are the writes are not that high, so you don't have to walk over the entire log but a part of it, and that's where we are saving. But then all the nodes need to do this because who exactly is building the B-tree in this architecture? That's still unclear if I understand correctly. We have some nodes that are writer nodes that write the data, and then whenever there is a query, some node is building the B-tree, or who is responsible for building this B-tree?

Simon Eskildsen [48:07]:
Yeah.

Kaivalya Abte [48:07]:
Is there something working on the object storage?

Simon Eskildsen [48:10]:
We haven't dealt with any problems yet that have made us derive away from just having one binary on one node. So it's really one node doing all of these things. We can split it, but we haven't from first principles yet encountered any issue that's causing us to want to do that. So yeah, we could split it. There are reasons to split ingestion, split reads, split all of these different things, but we haven't encountered anything yet that requires us to do that, right? If we want to optimize for reliability, well then, yeah, we need to have multiple nodes, and we need to do—we need to make sure that we route the table to where it's most likely to be in cache. But those are not really optimizations we've done yet. So for now, let's just assume that it's the same node doing everything. But there become problems with this if you're doing this, if you're adding things to the B-tree every like a thousand times a second like we were talking about before. Well, are you also writing the B-tree to object storage every thousand times a second? Every single time there's a new entry? Probably not. You probably put some heuristic to say, "I'm only going to write every time the WAL has grown by like n elements," because otherwise, you have too much put amplification and too much write amplification. Because every time you write one byte, you are writing the entirety of the entire database to object storage. Now, this is where you start to get into building an LSM, right? Because if this is a problematic trade-off for you, you can't just keep writing the entire B-tree into one object on object storage. You have to start splitting it. And that's why you build an LSM, right? At the end of the day, the idea behind the LSM is just instead of writing the B-tree map, streaming out to one file and keeping everything there, we flush batches of updates and then merge them together at read time. So I might say, "Okay, the WAL has grown by like a thousand twenty-four elements; I'm going to build a new B-tree for just that, and I'm going to put that into object storage." When I do a query, I have to read both of those files, right? So that's how we start to see the pattern. This is not really an LSM yet; it needs some more elements, but it's that idea. And then just all the other compromises and problems we encounter later will turn that into a proper LSM. Because, of course, if you're doing thousands of writes per second, that means we're doing one more file a second. Every time we do a read, a cold read, now we have to read potentially thousands of B-trees and do the queries on those. So we have to merge them periodically, and that's compaction in the context of fixing a fundamental flaw.

Kaivalya Abte [50:50]:
Okay, interesting. Yeah, so you mentioned reliability. We haven't really faced any problems when you just have one node doing everything—handling reads, handling writes, and probably also doing some compaction. Let's dive a little deeper into what advantages we get by separating out readers, writers, and compactors because some databases have this separation. Is it more control? Is it just differentiating the workloads? So probably writes are more important than reads, and we can scale writes differently. What are the different trade-offs here when we talk about separating these components into different nodes versus just having one node doing everything?

Simon Eskildsen [51:33]:
Yeah. So let's walk through every workload first. So for the compaction, right, which is fairly—like I was about to say labor-intensive, but it's a fairly resource-intensive activity. You have to look; you have to perhaps you're compacting the WAL; perhaps you're compacting these B-trees, or it could be an approximate nearest neighbor index, or it could be other types of derived indexes. In the case of an LSM, it's typically an SSTable, but it doesn't matter too much; it's basically just a sorted list of things. So the compactor node—it's nice to separate that because often what you find in these LSM-based approaches is virtually about object storage because of this append-only nature of the data. We like to separate that out so that they can once in a while download the data, compact it, and then change the pointers so that we use the new data. So that's a really good workflow to separate out. In the context of something like turbopuffer, we call this indexers. Yes, they compact, but the primary thing that they're doing is building these approximate nearest neighbor indexes, which makes compaction look like a very easy job in terms of resource consumption. So that the compactor might be doing the same thing; it's compacting data; it's also indexing it. Indexing into a B-tree is such little computation that you might as well do it on the query node because it's just an n log n operation. But for an approximate nearest neighbor index, the computation is much more expensive. It's the same with an inverted index versus full-text search.

Kaivalya Abte [53:07]:
Mm-hmm.

Simon Eskildsen [53:07]:
You can choose that depending on those compromises. Then you have the write workload. At the end of the day, the write workload, if you take the compaction out of it, is really just a matter of the deserialization of the data coming in from the user, serializing it into a WAL entry, and then adding it to the WAL. It really is just a type of data proxying and looks a little bit more like maybe a streaming, very, very simple streaming workflow than anything else. Of course, in some databases, you can even go so far as making that like essentially zero copy by having the client write in the same format as you persist into the database after some consistency checks. So writes are computationally very easy; they can be network intensive. So you may want to separate them out at some point if you start saturating the network a lot on writes, which may impact your reads, or prioritize reads over writes in like a networking abstraction that you do yourself, and you can rely on the kernels or whatever you want to do. We don't separate writes and queries. It would be very easy for us to do it, but that's why we don't do it. Another reason to not separate writes is that when you write, say, 4.json at a WAL, you could also write it to the local cache. If you do that, then it's already in the cache, and you don't have to go to object storage, which means that a read that's coming in immediately after is already hot. So that would be another reason to not separate it, and that's a very cheap operation to do when you already have the data. You don't have to—there's no get amplification. Then there's the query itself, right, which we talked about. There may be advantages or disadvantages of separating bringing it from the write. This is the highest priority task because it's very latency-sensitive in this type of system. In a traditional OLTP database, the database doesn't really have any sense of what's more important to write or read in terms of latency. But as a nature of our system, we know that write latency can't be the most paramount property of the database because then you wouldn't have chosen this type of database in the first place. So the queries you would often have on the same ones, right? A third type or a fourth type of workload other than compaction, GCing, writing, and reading would be some periodic activities, right? Typically, this would be done on the compactor. This might be things like billing or like GCing objects and things like that. But that could be another workload. Typically, you do it on the compactors, right?

Kaivalya Abte [55:33]:
Yeah, I think that makes sense. I guess we have to look at what kind of computation is required for writing, what kind of computation is required for reading, and compacting, right? That's when you'll be able to make a better decision and provide resources to these nodes. If it's, as you mentioned, data proxying, you probably don't need that compute-intensive resource; it's more like a network-intensive operation. But for compaction, it might also be network and compute-intensive; that depends. And for readers, I guess it's more like a network-intensive operation. But for compaction, it might also be network and compute-intensive; that depends. And for readers, I guess it's more like a memory-intensive operation because it has to cache data. You mentioned making reads and writes on the same node because then you already have this cached data because probably it's just written. But imagining multiple nodes doing this, are we talking about another node that keeps the mapping of who wrote this data or where are we putting this mapping where it knows this is the node that just wrote this data, and probably it should also be served from this?

Simon Eskildsen [56:36]:
Yeah, I think the simplest solution to this that I can think of is to do consistent hashing so that when you come into the load balancer, it looks at the key that you consistently hash on. It might be something like the tenant ID and then the table name, right? You hash that to the node. So let's say there's three nodes; we always go to node A for Kaivalya's data, and we always go to node C for Simon's data. So that would be how we do it. Of course, at some point, you're going to run into problems with simple consistent hashing because maybe your table is 10 billion elements and mine is 10, so again, if we look at it from the same lens as we did when we thought about how to separate compaction, GCing, writing, and reading, there's kind of a bar chart of the resource constraints that something utilizes: CPU, memory, NVMe, and network, right? Inside of those, you also have things like all the implications that we talked about before—IO, storage, whatever, right? So at some point, the load balancer may need to know properties about the system to know whether it should spread across multiple nodes, whether you should have your own node. There are various levels of sophistication you can do there. Multi-tenancy, in general, is a very difficult problem, right? That's what I spend a lot of my time at Shopify doing, just protecting tenants from each other and making sure that they get a sizable slice of the pie. You're not going to be happy if it's some major YouTuber that's releasing—that's dropping some new product, and that means that your store is down. You're probably going to be even more mad if turbopuffer, you know, cursor is one of the big workloads, and if their workload causes any problems for you, that's not a good database experience for you. So this multi-tenancy issue is, I think, an underrated problem for people who work on databases. It is supremely difficult to guarantee across all these subtraits like fair scheduling. Of course, the Linux kernel scheduler does a pretty good job of balancing things between tasks, but it doesn't know—it doesn't care that you are the one that, like, Kaivalya's queries have consumed 99% of all query time over the past 24 hours, and all Simon's queries are slow as a result. That's not a fair database experience, right? So you kind of have to go in front of all of these resources that I mentioned and write your own user-like kernel space scheduling layer so that you know how much you can consume, what's fair, and keep track of kind of the leaderboard of resource constraints and make sure that this works out. At the same time too, because a multi-tenancy system has almost unlimited resources, you may have gone away for the weekend, and for some reason, your thing is doing 20,000 queries per second, and that's not what you wanted to be built for. So there's lots of nuances here in how to run a successful multi-tenancy system, which could be a whole discussion in itself, and we can go into more depth on that if you like.

Kaivalya Abte [59:45]:
I think that's a pretty difficult problem, and you know many databases do it. They kind of do some calculations based on heuristics of what kind of workloads a typical tenant has, and you know, probably pre-scale some computation if they know this is the peak hours for these kind of tenants, so they have more resources in place. It also depends on how they are charging the tenants. If they are charging based on computation and they don't want to stop them, probably they'll just increase the computation resources and then charge them accordingly so it doesn't impact the other tenants, but they are also able to leverage this increase in computation requirements by bumping up the computation resources. I think that's a very important problem, as you highlighted. If there's no one solution for that, of course, we are going to start simple by consistent hashing. Eventually, when we have some problems, we can do more. One thing that we talked about, and you know, we also laughed a little bit about that, is storing our files in JSON format. That's because it's not optimal, right? What are the concerns or trade-offs there, and what are some other storage formats available that you have explored while building turbopuffer?

Simon Eskildsen [1:00:58]:
Yeah, I think we, you know, JSON is always a great place to start when you're exploring it.

Kaivalya Abte [1:01:04]:
Yeah.

Simon Eskildsen [1:01:04]:
We still use it for some of the metadata because it's very easy to inspect. It's very simple to understand. It's very simple to map to any language, and it's humanly readable. So I wouldn't underestimate it. But the problem with JSON is that it's very difficult to deserialize more than a couple hundred megabytes per second. So you have to go through this whole deserialization process, so we can find formats that have more efficient deserialization, right? Those exist, or something like Protobufs, for example, which is much faster to deserialize because there's just less constraints and less conditions that we have to check.

Kaivalya Abte [1:01:39]:
Yeah.

Simon Eskildsen [1:01:39]:
So that could be like kind of a v2; we could do Protobuf. There are a bunch of open formats as well, right? So there's formats like Parquet, which is of course one of the most known ones for columnar data. The problem with Parquet is that it doesn't support random reads, so you can't be like, "I know that ID 15 is in here; I just want to load that record." Parquet can't really do that because you have to download the whole file.

Kaivalya Abte [1:02:04]:
Yep.

Simon Eskildsen [1:02:04]:
There are alternative formats that provide these kinds of things, right? So you could go with one of these open formats. The other option that you can do, and some of the open formats support this kind of thing, is that you can do kind of zero copy into the data, right? So what zero copy means is that let's say you download a file; it's one gigabyte big from—it might be this B-tree.json file, and now it's not JSON; it's a zero-copy binary blob instead of JSON. You download it onto the machine, and you just get it in memory, and now your machine is ready to use it. It doesn't have to interpret it or anything like that; it just slams it straight into memory after checking that it's not some malicious thing, and then you can start searching on it. That's the kind of format that you have to do if you are serious about building a database where you're using disk and especially data on the network because you don't want to spend all this time copying data. One thing is the time aspect, which might not even be that long, but the CPU consumption means that the request per second you can do is quite low. What we use for turbopuffer is the Rust library rkyv, which is very flexible in the types that we can express and how we can lay out archives. We don't use any of the open formats because owning this format entirely ourselves just allows us to move faster. We didn't find a format that could do everything that we wanted, so we decided for the time being to do our own format for the LSM that powers turbopuffer.

Kaivalya Abte [1:03:36]:
Very interesting. But did you explore the Arrow and Parquet formats as well? So you mentioned that these open formats didn't work well for you. Is it just the flexibility of the current format, or is there something else that you also, you know, were concerned about?

Simon Eskildsen [1:03:52]:
Yeah, it really just comes down to the flexibility, right? Like I know exactly, and the team knows exactly how we want this data laid out on this, the kinds of data structures that we want implemented. Off the bat, we found things that we didn't want to do or we wanted to do differently on Arrow and Parquet, and trying to combine these things. So we made a decision early on to just, you know, just do those things ourselves. I think down the line that that may change. Maybe we'll open source the format of turbopuffer. Maybe we will allow you to replicate from turbopuffer into an open format. Like we really have nothing against the open format. It is just a question of how can we go fastest with the most flexibility? And that's why we chose rkyv. With rkyv, you can implement a completely custom data structure on top of it and get away with it. We've utilized that quite a bit for some performance-sensitive features and so on.

Kaivalya Abte [1:04:45]:
Okay. Let's take an example. So you mentioned you can implement a data structure on top of it. Talking specifically about vector databases, there are specific indexes required. Let's take an example and explain how this flexibility helps you in turbopuffer.

Simon Eskildsen [1:05:00]:
Yeah, I think the flexibility is not so much about the ANN index—the approximate nearest neighbor / vector index. It's more in how we design the LSM and how we walk all these compromises that we've talked about, right?

Kaivalya Abte [1:05:14]:
Okay.

Simon Eskildsen [1:05:14]:
We can completely control how things are laid out on disk, how many round trips we can do, and all those types of things. I think some of those formats have not been designed for minimizing round trips to object storage. We have to do a lot of bit-twiddling and things like that to make sure we do as few as possible. We just wanted ultimate control on that front. We can certainly talk a little bit about how to do an ANN index on top of an LSM or on top of this rkyv-based format if you want to get into that.

Kaivalya Abte [1:05:44]:
Absolutely. I mean, it deserves a different episode because there's so much to talk about. For now, I think we have designed and looked at different aspects of or challenges and trade-offs of building a database simple enough on object storage and how do you think and, you know, build something like that from first principles.

Simon Eskildsen [1:06:05]:
Can I just add one thing that I think we didn't get into, but I think it's important, which is just—I think a lot of people, if listening to this episode, might be wondering why there are so many object storage databases that are appearing now, right? Object storage has been around for a long time. So why is it happening now? I think that there are three events that I have in my mental calendar and that I think a lot about. I think it is important to have your listeners understand those because that will help put in context why this is such a big thing right now. Now, the first one is that NVMe SSDs did not become generally available on AWS until June of 2018. That's still relatively recent.

Kaivalya Abte [1:06:46]:
Yeah.

Simon Eskildsen [1:06:46]:
GCP had it in 2015. The second event, which I think also most people don't know, is that S3 did not formally launch itself as consistent until December of 2020. That is remarkably recent, right? That's basically 2021. It's only like three years ago. Then I think that the third thing is that S3 didn't have some of these primitives that we talked about for consensus, like conditional writes (e.g. put-if-absent), until relatively recently, which help build these systems, right? So those are the events that have transpired that make this an architecture—it almost seems obvious now, right? It's like every innovation is just sort of like the lily pads transpire, and people go further and further out now, but I think those are very, very important to keep in mind.

Kaivalya Abte [1:07:33]:
In the interest of time, we'll take all the areas of discussions that we have left in this episode for future episodes and talk a bit more in depth about vector databases like turbopuffer. I enjoyed the discussion and learned quite a lot about how you think about each layer by layer and reason about each of these trade-offs using first principles. So thanks a lot for joining me today and throwing light on how one can approach building a database on top of object storage. It was a good learning experience. I also want to plan further episodes on different types of areas that we have left for this episode. Thanks for joining, and it was really amazing.

Simon Eskildsen [1:08:16]:
Thank you so much for having me on. There is nothing that I enjoy more than building something up from first principles. So this was a treat.

Building a database on object storage

Transcript

Guides

API Docs