Kaivalya Abte [0:00]:
Hello everyone, welcome to the Geek Narrator podcast. I am your host Kaivalya
Abte. In this episode, we will talk about how to approach building a database on
top of object storage, what the key challenges are, and how we can learn from
Simon's experience building turbopuffer. So welcome, Simon. I am really excited
to have you on, and let's start with a little bit of introduction about
yourself, your background, and how you ended up building turbopuffer.
Simon Eskildsen [0:27]:
Thank you so much for having me on, Kaivalya. I'm really excited to be here.
Yeah, we can start with a little bit of background. I grew up a couple of hours
north of you in Denmark, and I left to work at Shopify after high school,
thinking I would do a little gap year in Canada working for what was at the time
a small successful Rails shop. And then I got super interested in the
infrastructure that powered it, specifically the databases. When you're working
on a SaaS app of that scale, typically the thing that breaks the most is
whatever has the most state, and what has the most state is the database
generally. So I spent almost a decade working there from 2013 until 2021 on
mainly the data layer of Shopify to make sure that the product ambition of
Shopify was not limited by the infrastructure, and even more so the customer
ambition because we had some of the big West Coast celebrities that would sell
an enormous amount of product—Kylie Jenner and Kanye, and they were all somehow
in the Kardashian family or affiliated. I joined when we were doing a little
less than a thousand RPS, and when I left, it was doing more than a million at
scale. As part of this, you work on almost every single part of what powers the
data layer underneath Rails, but just above the database. Of course, you need
extensive knowledge of how the database works. When you're scaling that fast,
you don't have time to write the perfect database for your workload; you have to
do the best with what you have, and that's what I think we did. I worked on
moving Shopify into multiple data centers, sharding MySQL, sharding everything,
running out of multiple regions, failing over inside of a Slack bot, writing the
storefront to be better for caching, and then multi-tenancy and protection was a
big part of it. I left in 2021 and spent some time with my friends' companies
helping them with various scalability challenges, kind of hopping around and
helping them in quarterly increments. These days, the main database challenge
seems to be tuning auto vacuum in Postgres. This was the common denominator
across almost everyone that I worked with. Of course, working with MySQL for a
decade, this is not really the issue. There are other issues, but that is not
one because the way that MVCC works is different. One of the companies that I
worked with builds an application called Readwise. What it allows you to do is
take the articles that you read online and put them into a queue so you can read
them and things like that. One of the things we wanted to do was article
recommendations. They asked me, "Can you spike a little project to do article
recommendations?" I said, "Yeah, sure." We did some experiments using
embeddings. So embeddings, right, you take some data—unstructured,
semi-structured, or completely structured—and you turn it into a vector. For
your audience that may not all be familiar with what this is, that might be a
quick explanation for what this is. You might use something like Spotify; this
is literally how Spotify works. You take a song and feed it through the model;
the model will spit out a coordinate in a coordinate system. You can imagine it
in two dimensions, right, like X and Y. Here, all the songs that are rock are
here, country songs are here, and in between, you have the artists that span. In
order to do recommendations, for example, it might be your catalog. You can
imagine that you will color a bunch of these coordinates that you've listened
to, and then we can find recommendations by finding adjacent songs to those and
doing a little bit of geometry to figure this out. That was what we were going
to do for these articles. We were going to take the articles and chunk them up
into pieces and turn that into vectors, and then we could look and say, "Well,
these are chunks that you've read in the past." Maybe you've read about how to
keep a baby alive if you're a new parent. Maybe you're reading about object
storage, whatever. That's going to really challenge vector space. Then we could
find similar chunks. We did some experiments with this, and it was promising. It
would take some more tuning. But at this point, I sort of ran the
back-of-the-envelope math on taking all this data that was inside the Postgres
database, chunking it up, putting it into a vector database, and then doing the
recommendations. This company was spending around $3,000 a month on their
Postgres database. The back-of-the-envelope figure for doing this with a vector
database, the ones that were available at the time, was $30,000 a month. You
can't eat $30,000 a month, even if this was worth it from a product perspective.
The fact that it's 10x what you're doing with your relational data—the database
where if it goes down, your entire company is down—you can't do that. This is
just like an unfeasible thing. So we kind of put this in the bucket of, "This is
going to get cheaper," right? Like everything else, the context window is going
to grow, the tokens are going to get cheaper; surely someone's going to figure
out how to store the coordinate system cheaper, right? But I couldn't stop
thinking about it. The outages of running Elasticsearch, which is operationally,
I think, one of the most challenging databases that I've ever operated from my
time at Shopify, were flashing before me. To me, it's always been very obvious
that for a search workload, you have to separate compute and storage, and no one
had really done it before. If you kind of imagine a spectrum of OLTP to OLAP,
you can imagine that for OLTP, you need very low latency, low latency writes.
Everything happens like sub-millisecond if it's a point lookup, but at least
single-digit milliseconds in the worst case, right? If you're writing a big
Rails application, you're depending on all of these point lookups being hundreds
of microseconds because you're doing them sequentially. Now, on the other hand,
you have the OLAP solutions, right, that have separated compute and storage now
for a long time where query latency in the seconds is completely fine, right?
You're doing a lot of writes, and you're doing a query once in a blue moon to
generate a report. Search is somewhere in between. You're actually okay with
maybe 30 to 50 milliseconds of latency; you're processing more data, but your
write-to-read ratio is enormous. It's not uncommon that you're writing like 100
times more than people are searching for stuff, right? The ideal architecture to
me just looked a little bit like a hybrid between the two. You need really good
cache locality, but having unbounded compute is also there. The other thing too
that you notice, and anyone talking on search will talk about this problem, is
that if you're doing search to do good search, you've got to do a lot of
experiments. Some of those experiments you can kind of tune some parameters and
see that your search results are improving, but for other experiments, you have
to rewrite all the data to add some more data. Maybe you want to weigh in the
popularity of the content, how much it's been clicked, how recent it is, and you
didn't do that the first time around, so you're rewriting this data a lot.
Often, you're provisioning for your steady-state write load. Now, if you have to
readjust everything, it's often very computationally expensive. To me, this
object storage-first architecture just seemed really like the right fit. Back to
Readwise, I just had a moment of like, "Okay, you know what? Screw it. I'm just
going to sit down and try to write this thing." That's what I did last summer,
and then we released it in September or October of last year.
Kaivalya Abte [8:02]:
Amazing. That's a very interesting story, right? You talked about two very
important points: how the monthly expenditure on Postgres went from $3,000 to
$30,000. That's quite a lot. The second one was about how to keep a baby alive.
That's so important, right? I am a new parent; it's really a tough problem. But
anyways, we are here to talk about databases. Thanks for sharing the background
of how you ended up building turbopuffer; it's important for people to connect
more. I want to understand first, is the problem only the amount of data that
you're storing, or along with it, do the computation requirements also grow?
Specifically for the example of going from $3,000 to $30,000, what was the exact
reason for this increase? Is it just the storage requirements because of the
number of vectors you're storing, or is it also computation attached to it?
Simon Eskildsen [8:59]:
Yeah. First, I just want to clarify something. The $30,000 a month was to do the
vector workload of a subset of the data in the Postgres database that cost
$3,000 a month.
Kaivalya Abte [9:09]:
Okay.
Simon Eskildsen [9:09]:
It was not the cost of adding vector indexes into Postgres. That may or may not
be lower; unclear. What you're asking about specifically, though, was around why
this architecture is good for certain workloads.
Kaivalya Abte [9:26]:
Yeah.
Simon Eskildsen [9:37]:
So a write to object storage is generally around 100 milliseconds, a little bit
higher if you're doing multi-part for uploading a bunch of data. You're going to
have to do multi-part if you want to exhaust the network, but ballpark in the
hundreds of milliseconds. If you are a transactional database and you're doing
lots of writes and doing transactions, this is too high, right? It doesn't
really work. But for a search workload, it's perfectly acceptable, especially
given that the throughput—there's really no constraint. You can do much higher
throughput than a traditional OLTP database, right? The other thing is that
these vector indexes are enormous, and that's what's really forcing this change.
If you think about it in terms of data, right, you might have an article that's
about a kilobyte of text, and that kilobyte of text, if you transform that into
a vector, that vector might be in the ballpark of six kilobytes. If your vector
data is actually larger than the original data itself, it gets worse than that
because what you end up doing is that often you say, "Well, I can't—I don't want
to do a vector embedding of the entire article; there's too many things going on
here." To generate a good representation in vector space, you chunk it, right?
So you chunk by paragraph or something like that. Let's say there's a
one-kilobyte article; now it turns into four vectors, each one of those vectors
is six kilobytes. Well, now you have like 24 kilobytes of data, and your space
amplification on the original data is 24. So this is really the forcing
function: there's just an enormous amount of data; the space amplification is
phenomenally high.
Kaivalya Abte [11:23]:
Yeah.
Simon Eskildsen [11:24]:
That's what warranted me to start looking at a different architecture for this
altogether. On a traditional architecture, either an OLTP or something like
Elasticsearch that works somewhat similarly, when you do a write, that write
goes into a primary and then is often replicated to two secondaries. That means
that you're storing every data on disk three times. Generally, you store this on
an SSD or a very high-performance network disk like EBS or whatever. The reason
is you do that is because at any point you could fail over to those other
replicas, right? That's part of the point of having them, so they need to have
the same thing. If you want to buy a gigabyte of EBS, like networked storage,
which is the most reliable storage that you buy in the cloud among block storage
options, you have to pay about 12 cents per gigabyte. So if you have three
gigabytes of data because you have to store it on three nodes—the primary and
both secondaries—now you're paying like 36 cents per gigabyte, right?
Kaivalya Abte [12:29]:
Yeah.
Simon Eskildsen [12:30]:
Because it's 12 cents. So that means that you're running at 100% utilization,
right? That's like every bit is filled; there's no excess space. Typically,
you're going to run this at 50% utilization to not get nervous, right? Maybe you
do 80% or something like that, but that also requires a lot of ops if you want
to resize that much. A lot of production requirements run with less than that.
But 50% utilization—well, now you have to double this number. So if you want to
double that number, now we're talking more like 70 cents per gigabyte. If you're
doing a managed offering, you kind of want to see at least a 50%, preferably an
80% profit margin on that kind of storage. So now, like, the actual cost of
goods sold, the base cost is almost 70 cents per gigabyte. But the cloud
provider has to try and charge you like double that in order for them to offset
all their other spends, right? Often three times. Now with object storage, the
cost just looks completely different, right? The cost of goods sold is that
every gigabyte I put on object storage is already replicated, and it already has
more nines of durability than any other system on earth. It's already
replicated; you benefit from their multi-tenancy, and you're paying two cents
per gigabyte. Then you put it on an NVMe disk, and that NVMe disk costs you—and
it's not a network device, but this is like an actual NVMe disk, right? SSD
that's attached physically to the box that you're on, so you're not using the
network when you are with a network disk. It's much faster, it's much cheaper,
and has amazing performance, and now you're talking about 10 cents per gigabyte.
Then you can charge a margin of that that's actually cheaper than hosting
Elasticsearch or something else like that with a more traditional deployment
model yourself. The economics of that, given that the biggest trade-off here
really—the only real fundamental trade-off—is write latency. Especially if you
wanted to do transactions, this is a really good compromise for search. This is
not a good compromise if you're writing like a checkout engine or inventory
management or any of the problems that, for example, a Tiger Beetle is solving.
It's a terrible set of compromises, but for something like search, it's a
phenomenal set of trade-offs.
Kaivalya Abte [14:50]:
Okay, so we talked about trade-offs, right? I want to dive deeper into the topic
at hand, which is how do you think about building a database on top of an object
store? Since you're talking about trade-offs, it's clear it's more challenging
for you to build a high-scale transactional database on top of object storage
than it is to build a vector database or a search database, right? Because of
the expectations or the different trade-offs in terms of write latency and read
latency that you have. So I think the first thing we should talk about is, let's
take a step back, right? You mentioned the motivation of thinking about object
storage, which is probably cost efficiency, which could also mean less
operational overhead because a lot of availability and durability guarantees are
provided to you by the object storage. So how do I approach this problem? When I
think about, "Okay, let's say I want to build a database on top of object
storage," the first thing I have to understand is the trade-off, and that's
clear from how you defined it. What are the next steps? How should I think about
it? If I look at a high level, like an architecture, it is clear that there's
object storage, which is my main storage component, but then I have to also have
some caching layer and then the clients. How do you think about that? Let's
start from there, and we can unwrap each layer at a time and talk more.
Simon Eskildsen [16:14]:
Why don't we just talk through then? I think this is what you're getting
at—designing the simplest possible database on top of object storage, and then
we can talk about how to optimize every part of it.
Kaivalya Abte [16:25]:
Exactly.
Simon Eskildsen [16:26]:
Turbopuffer aside, but the experience, of course, of building it is very much in
hand. So the first thing you want to do, right, is that you have to accept the
write. There are no reads without doing a write. So that's the first thing you
do. In the simplest possible way, using object storage, what you would do is
create a directory or a prefix called write-ahead log, or WAL for short. Every
time someone does a write, you put a file inside the WAL directory, and you just
increment the number. So the first one is called 1.json, 2.json, 3.json, 4.json,
right? The write-ahead log is where the data gets appended; any new data lives
in there. If you delete a record, it's just registered as, you know, in the
record, it's like type, and the type might be insert, the type might be delete,
and that's really all you need. Then you have the actual record with the data,
which would just be, let's say this is a simple JSON MongoDB type of scenario.
So this is your database, right? Like, really, this is a database in itself. You
can start to satisfy queries on top of this, right? So what we do now is that we
have a node, and there's a binary on that node. This could be written in
whatever language that you choose. Now I do a select all from table. What that
would do is that it would just go into the write-ahead log and stream through
it, respecting the deletes and the inserts. Okay, now we have a database on
object storage. That's all you need. Like, you know, this can do any query. It's
just going to be a very slow full table scan, right? That's a real database. I
think it's going to be very hard to write something that's much more reliable
than that because none of those files are ever going to get lost. You have like
full time travel. Someone could send you a query to say, "Hey, I want to do this
query as per the fourth entry," and then you just do a full table scan up to
that. This is a great database; this database can do a lot of things. It scales
quite terribly, but from here on, like, that's a real database you could ship
that, right?
Kaivalya Abte [18:41]:
Absolutely. But how do you query the data in an efficient manner? So, okay, but
I want to take a step back, right? When we talk about traditional databases, we
say that, let's talk about PostgreSQL, for example, you have your data pages,
and then you also have your WAL, and you write to a WAL to ensure durability,
and then you write to the buffer cache, and then you periodically flush that and
make the entries in the real data pages. I think the reasoning, if I remember
correctly, is that writing to a WAL is much faster than updating the data pages.
But if I think about object storage, think about a WAL versus maintaining it in
a structure which is efficient enough already for reading data. How do you
compare these two approaches? Why do we need a WAL when we are using object
storage?
Simon Eskildsen [19:47]:
Yeah. I mean, I think the best way to answer your question might just be to
continue to evolve on this database, right? We can think of SQL queries that we
want to execute on this database, and then let's see the design evolve, right?
Kaivalya Abte [20:00]:
Yeah.
Simon Eskildsen [20:01]:
Then I think let's backtrack and then do something a bit more akin to what
you're saying, which I think would be a way of approaching it. But it's not as
simple. I like to just build up in complexity and defer it until we need it.
That's how you ship fast, right? So let's say now that one of the things that
you want to start doing is that you want to do an efficient lookup, right? So
you want to do select all from where ID equals something, right?
Kaivalya Abte [20:33]:
Yep.
Simon Eskildsen [20:34]:
The way that we're doing this now is that it's pretty slow, right? Because now
you have thousands of entries in the write-ahead log, and you have to stream
through all of them. We haven't implemented any optimization. So what ends up
happening is that on every query, we go and get thousands of objects, right,
from object storage. This is slow because let's say that this is like gigabytes
of data or megabytes of data. You can saturate the network, sure. Depending on
the box size, we're talking into tens of gigabits per second, so that's going to
be a concern. So what we could do is that while we know that the write-ahead log
is immutable, so 1.json is never going to change, right? So what we could do is
we could start putting those in a cache. We could put them in a memory cache.
That's the simplest. So now every time we go to S3 or something S3 compatible to
get the WAL, we first look and see in the memory cache whether 1.json exists or
whatever, right? Then we scroll through them in memory. That's probably going to
be pretty fast. I think then, you know, you could do more cache tiering like
what turbopuffer has of also having an NVMe SSD cache and things like that. You
could actually build a database that could do a lot of things just on top of the
write-ahead log and streaming through it. Now, what you're getting at before is
to start building some kind of derived indexes on top, right? Even potentially
one of these indexes could be the primary data storage, right? So that you don't
even write to the write-ahead log; you just update that primary index where the
data is stored as the primary one. I think we can get into that in a second of
why that might be a good idea and why it might be a bad idea, but right?
Kaivalya Abte [22:21]:
Yeah.
Simon Eskildsen [22:22]:
Now we have this point lookup; it's actually pretty good now. It checks the
in-memory cache first; if it misses that, it goes to the NVMe SSD cache. It's
still scrolling through everything until it finds the ID, but it has to go all
the way to the end because you could have upserted it multiple times. So that
the stream through everything, we could stream in reverse so that could be an
optimization we can do, and then we might find the ID earlier. But we might do
things like this, but of course, if you want to make this fast, you've got to
start creating some derived indexes off of the WAL data. We might say, "Okay,
we're going to do point lookups because we want to have range scans and things
like that." If you're only doing point lookups, you can do a hash map. If you
want to have range scans and some ordering and things like that, you might do a
B-tree. So we could create that, right? So what we do is that we could just
build it in memory. So we have the entire write-ahead log; every time you ask a
question, we have an in-memory B-tree map, and we just consult that for any
updates, right? Every time a query comes in, we just look in there and see if
it's there. If it's in there, we just serve it out of there. Now you might start
to think about consistency, right? Because now another process might be writing
and have added, you know, 3.json, and the B-tree map was built at 2.json. Do you
have to now—you have a decision point. Do I want eventual consistency? Do I want
strong consistency where you can read your write immediately, right? Like I
insert and then I see it immediately. So you have a decision point. You could
decide that now what you want to do is, "No, I want to be consistent." That's
how I want to do it. So what you're going to do is that every time a query comes
in, you do a list call to S3 in the WAL directory before you do everything else.
Then, of course, if all those files are in cache, you're good to go, right? You
know that you've updated the index up to that WAL pointer, and you can just
serve that out of the cache. You have to do the list call. Those list calls
might take a couple of tens of milliseconds, but you have a consistent database,
and it's really simple to run. If your query node goes down, you're good. Now
you might be in a scenario where you're starting to get a lot of queries, right?
Your little box that you have this run on can't do anymore. You've been
optimizing all of this; your B-tree map is pretty good; you're doing all this
async indexing, and you're doing all this read coalescing, and you've done all
the optimization you can figure out with performance. So now you have multiple
nodes. You have the write to the WAL for multiple nodes, right? Maybe a query
can come into any of them. So what you do now is you do a list call before you
do the write and like, "Okay, it's currently at four; I'm going to add five."
Then you use a compare-and-swap kind of mechanism to make sure that you can only
put it if it doesn't exist, and you can start to add layers of complexity like
that.
Kaivalya Abte [25:08]:
Interesting. Yeah, so just to recap a little bit what we have by now, right? We
have written to a WAL that's there in the object storage. We have also built
some sort of indexes like for key lookups, key-value lookups; we have a hash
map. For range queries, we can have a B-tree. For consistent writes, you can
also keep these B-tree indexes in sync by making a list call to object storage
and so on. Let's take a step back. While I'm writing this to the WAL, what are
the things that could be challenging because it's on an object storage? For
example, if I have two writers trying to write a file, and the other one writes
2.json already, and the other one is trying to process, and then as soon as it
has enough data, it also writes to 2.json. How do you handle this kind of
conflict between two writers? Let's start from there and identify some other
challenges we might be facing.
Simon Eskildsen [26:05]:
Yeah, I think there's probably five different things that you could do at this
point. So we can go through them, right? It's another decision point. We made a
decision earlier that we're going to be strongly consistent. Now we're at
another decision point of what do we want to do if we have multiple writers? So
one way that you could solve this is by deciding that you never have multiple
writers; you only ever have a single writer. This rule will have problems
because you kind of have to enforce that somehow.
Kaivalya Abte [26:32]:
Yeah.
Simon Eskildsen [26:32]:
How do you want to enforce that? That's another decision point, right? So we're
not going to walk down that decision tree. But that could be one way. Another
thing that you can do is that you could bring in some kind of lock service,
right? Because you could sort of say, "Okay, while I'm writing to the WAL, I
have to have a lock." So I go to ZooKeeper or I go to etcd or I go to some kind
of consensus layer and ask the consensus layer for permission to write to the
WAL, write to the WAL, release the lock. Now, if you have multiple writers,
that's going to work just fine, right? One comes into node A, writes, and then
node B is waiting until that write, and there's some fairness there. The third
option, I guess, we're at now that you could do is that when you do the write,
the first thing you do is you do a list. The list might come back and say the
WAL has 1, 2, and 3, and then you should be writing 4. So then you write in 4,
but you put a condition on the write operation. All the major object store
providers support this, where you can say that if 4 exists when I write it,
reject the write. It only allows you to add it, right? Then it comes back, and
you can decide what you want to do. Do you want to fail, or do you want to just
try again with a bigger number, right? Of course, this changes kind of the
semantics around what database guarantees that you get in because you now can't
guarantee that 1 and 2—that's also the order that the writes go into. That may
or may not matter to you. The fourth option that you could do, actually, let's
keep it there because I think the other options I can think of are derivatives
of either not caring with a single writer or doing something consensus-based or
then using object storage itself. Every other variant I can think of off the top
of my head are variants of those.
Kaivalya Abte [28:29]:
I think conditional writes are now supported, so that is a good way to handle
this conflict. You mentioned something about the ordering of the writes, and in
some cases, that could be extremely important. The traditional way of handling
that is to have a single writer so it can order based on some timestamp or
properties. But let's say we have multiple writers, and then, of course, there
can be different failures. One write can be slow; the other writer can be fast
and so on. In that case, how would you think about ordering guarantees if that
was a constraint?
Simon Eskildsen [29:04]:
Yeah. Oh, there is a fourth option here, which is relevant to what you're
discussing that's very simple, which would be to use some UUID v7 or something
like that, right? Like a semi-ordered UUID with the first, I can't remember,
like 50-something bits or timestamp, but the rest is randomized.
Kaivalya Abte [29:20]:
Yeah.
Simon Eskildsen [29:21]:
You could do that, which has similar properties. What kind of guarantees do you
want around writes? I mean, there's a layer here, and there's extremely
technical vocabulary of how to use this, and I get it mixed up all the time, and
I think it would be very hard to parse in audio format. So we're not going to go
too deep into it, but rather just talk about this at a high level. The first
thing, right, is just how do you resolve the multiple writers, right? Like
there's two writes coming in at the same time. Do you guarantee that if this
one, this request was accepted first, that the write then comes before the other
one, right? You might be writing the same record. I think for some workloads,
this may be very important, right? There are different isolation and guarantees
that you can have around this transactional nature of it. I think that for
something like search or something else, it becomes a little bit less important.
I think that generally what you would try to do in systems like this is to do as
good of a job as you possibly can to make sure that the writes come in in the
order that they arrived, typically through a single writer. The multi-writer
scenario happens more if there's some kind of outage, right? If you want to have
very strict guarantees around the ordering and how things come in, you can do
that even with these models that I propose. But I think it might be getting into
the weeds of the transactional guarantees, the isolation levels, and all of that
in a way that may not be super fruitful. I think it's maybe more interesting to
talk about that, of course, when you return a successful response back to the
client, we know that it's been committed to object storage, which is like a very
strong guarantee. Another type of guarantee that's very easy to reason about is
that everything that's been committed together in our WAL database is going to
be applied at the same time. So you might have a WAL entry that updates 1, 2,
and 3, and all of those things will always be seen at the same time, right?
Kaivalya Abte [31:24]:
Yeah, that's right. As you mentioned, there are different trade-offs for each of
these approaches, and it all depends on the guarantees you need. We don't want
to go into the realms of transactional databases where there's really strong
guarantees, but still be in the search space. Probably taking an example would
be helpful here, for example, from Shopify or something. There are different
aspects of writes as well. For example, I want to talk a little bit about how
frequently do you write data to object storage because there is also cost
attached to that part, right? How do you approach that kind of trade-off between
the number of writes to the object store versus caching or batching them
together?
Simon Eskildsen [32:06]:
For sure. Yeah, I mean, let's phrase it in terms of our very simple database,
right? We'll call it SimpleDB. In SimpleDB that we're talking about here, you
might have a scenario where, yeah, someone is clocking in a couple of writes per
second. Of course, you could do 1.json, 2.json, and so on for all those writes.
But if they don't care a ton about write latency, what you might decide is to
coalesce those writes. In a traditional database, this is called a group commit.
If you think about, for example, something like MySQL, if you're doing thousands
of writes per second, you may have performance issues if you try to sync to disk
for every single write. I think I wrote an article once called, "Is the number
of transactions per second equal to the number of fsyncs per second?" That was
the simple hypothesis going on, right? If an fsync takes one millisecond, which
it often does on a modern SSD, by the way, I collect all these numbers on this
GitHub repository called sirupsen/napkin-math, where I have all these numbers.
One of those numbers is the cost of committing durably to an SSD. So, okay, if
that takes one millisecond, that means that you can do a thousand transactions
per second, right? Well, no, because MySQL is smarter than that. If you're
updating one byte there, two bytes there, and so on, all these different
records, we can coalesce that into however much data the disk can durably commit
at once, right? An fsync can, I think it depends on the disk and the operating
system—more to disk than the operating system, right? But let's say that it can
commit four kilobytes at once. Well, if every single write comes down to 100
bytes, well, now you can commit 40 things at once rather than one, right? So now
instead of committing a thousand times, like one over one millisecond, which is
a thousand fsyncs per second, you can do a thousand times 40, so that's 40,000
transactions or updates per second, right? So that's one thing. So we could use
that constant in object storage; the numbers look very different, right? A
commit takes hundreds of milliseconds depending on the size of the commit. So we
might decide that every second we only flush a handful of batches to object
storage—whatever lines up with those hundreds-of-milliseconds commits. It comes
down to cost, right? So it's really more of a latency requirement of like,
"Okay, 250 milliseconds or so for writes is acceptable to the users." We're
going to create windows of that, and then you can imagine kind of like, "Okay,
you're in the first 250 milliseconds of a second; we coalesce all those writes."
There might be like hundreds of writes coming in, and you just literally sleep
inside of the node that's accepting the writes, and then you coalesce them
together, commit them, and then you send it back to all the clients, right?
Kaivalya Abte [35:06]:
Yep. Now I think that's an interesting point. I guess there was also a
discussion when we talked about SlateDB with Chris the last time. There was one
interesting point where SlateDB does it based on the time, let's say some
linger-ms configuration, which has, let's say, 100 milliseconds or something,
and then for 100 milliseconds, it will just try and batch and collect it in
memory and then flush it to object storage. But then there was another question
that I asked; it was about the amount of data. What are your thoughts on that?
Waiting for a certain duration, like you mentioned, like sleep for a certain
duration, and that sleep is defined by the time, let's say the wall clock. How
do you compare that with the amount of data, and what kind of workloads would
make sense for the other approach?
Simon Eskildsen [35:51]:
Yeah, that's an excellent question. The way we think about it, the model that
I'm presenting here, by the way, is a heavily simplified version of what
turbopuffer actually does underneath the hood, right? We're just starting from
first principles here and building up, which is my sport.
Kaivalya Abte [36:06]:
Yeah.
Simon Eskildsen [36:07]:
And so time-based versus size-based, it's really a cost question, right? The
fastest thing that you can do is that every single time a write comes in, you
just commit it to object storage, right? Your conflict resolution is not this
incremental simple thing we just talked about; it's something different we could
talk about what that could be. But that would be the absolute fastest way to do
it. The whole reason to coalesce the writes in the first place is to reduce what
we call the put amplification, right? When you're talking about the database,
there are different types of computation that matters. There's the get
amplification, which is how many S3 calls that you're doing. A million gets cost
about 40 cents. There's the put amplification, which is the number of operations
where you're writing or doing something write-related, like listing, which is
also expensive, or multi-part start, multi-part commit, and a multi-part part
number upload. Then there's the storage amplification, right? Which is how many
times are you storing an individual byte? There is the compute amplification,
which matters more for building indexes. There is the read amplification;
there's the write amplification. That's been very well studied, right? Any LSM
paper out there is going to just talk about storage amplification, read
amplification, write amplification, the write workloads. I'm sure Pavlo, Andy
Pavlo, has written a beautiful paper putting in a diagram like which compaction
algorithms optimize for what, right? But with object storage-based LSMs and
databases, we have to walk another two dimensions of compromises, which are the
get and the put amplification. When we write to the WAL, that's really what
we're balancing here, right? The most optimum fastest thing to do is just to
commit every single write to object storage. If we coalesce them, we reduce the
upper bound of the put amplification to four puts per second with the scheme
that I just mentioned. Or maybe we're okay with even higher or lower put
amplification, but even higher write latency, which is what we're directly
trading off here, saying that we're going to commit once a second now. Then
there's a tunable here, right? Which is just tunable between these two extremes
of, "Okay, like my 250-millisecond window has grown to a gigabyte; now I'm going
to upload that." You can saturate the network doing that. If you can assume that
you saturate the network, then it's not really a huge issue. But if you drill
down even more, in order to saturate the network with a single gigabyte chunk,
you have to use multipart uploads with carefully tuned part sizes. So you may
instead decide that you're going to split this into individual chunks and do
individual puts instead of multipart. But now you have individual files, which
increases the read amplification and the get amplification down the line. I'm
not saying this to intimidate you with the complexity, but more so that this is
a set of compromises. I think the most important compromise for an early
database is simplicity. What is simplest to implement? Right? Then you go with
that. Once you start finding from real workloads what the problems are, or maybe
you know so much about the workloads that you can infer it, then you can graph
all these things and then choose the optimum point for flushing at whatever
interval. But a blanket answer doesn't really exist, right? In turbopuffer,
we've settled on a couple of standards that we've, you know, binary search and
bisected our way to by customer requests and seeing the different trade-offs
there and optimizing our cost of goods sold, right?
Kaivalya Abte [39:45]:
Yeah, I think SlateDB also uses the simple approach, which is just wait for some
time and then try to upload. Probably when we build this database, we have more
use cases; we can also give this choice to the user that, you know, you decide
because the cost is directly impacting how you are charged. So it could either
be a hybrid approach or just, you know, one of these two. And that makes sense;
it's all about trade-offs. Talking about writes even further, I want to take a
step back and, you know, talk about how these writers are even coming and
becoming part of our, you know, let's say the cluster. Since we were talking
about conflict resolution, that one writer is trying to write this file, and
let's say 2.json, and the other writer is also in progress. Then you have to use
the list operation, which is even costlier. Can we not do something like when
the writer comes up, they register themselves in object storage and probably get
a writer ID or something? Instead of using 2.json, we could just suffix it with
the writer ID as well, so we never have this kind of conflict. What are your
thoughts on that? Have you seen this approach? Have you evaluated something like
that?
Simon Eskildsen [40:57]:
Yeah, so you want to put the ID in there, and then you can accept the write from
any node. Yeah, 100%. I mean, to me, this is just a variant of doing the UUID
v7, right? It gets a little bit more difficult to argue about what the ordering
of operations are, but you can definitely do that, right? I think the isolation
guarantees you get are not as high as with the other model. But again, this is
like a different set of trade-offs. I think that would absolutely work. I think
if you're doing that, you might as well just do a UUID v7 or something even
simpler so you don't have to keep track of or like, yeah, it's not a meaningful
number, right?
Kaivalya Abte [41:37]:
Absolutely. I guess, I mean, I think we should talk a little bit more on, you
know, the read path and how do we make it performant and the caching and
tunable, you know, caching and all of that. Probably then we can talk about, you
know, what happens when the writers go down. So doing a little recap, we have
our WAL in our object storage; we have also built some sort of indexes like hash
indexes, binary trees, and, you know, in memory. We have some tunable properties
on how we, you know, flush data to object storage. We have some, you know,
trade-offs there to understand depending on the workload. Now, let's say our
range or hash-based queries are fast enough. What's next? How do we make the
reads even better and more cost-efficient? Because I guess it all depends on the
amount of size that it could cache on each node, right?
Simon Eskildsen [42:25]:
Yeah, I think, I mean, the trouble with this architecture, right, is that we
talked at the top about the cost of doing various things, right? The memory that
you get in the cloud usually costs you about $2 per gigabyte. The architecture
we're talking about here works great, right? But it's assuming that we
essentially just hold all the data in memory all the time. So we're just using
object storage for persistent storage, and then really, you know, if you think
about it, a database is just kind of like a cache or derived indexes over the
write-ahead log, right? In our case, we've just chosen to put everything in
memory, and that's fine. But we have a couple of problems here on the read path.
The first problem is that a cold query is going to be incredibly slow, right? It
has to saturate the network reading the write-ahead log, which could be millions
of entries. You have to download everything; then you have to replay the entire
log. Especially if it's JSON, you're going to be spending a lot of time
deserializing and doing a lot of memory copies to get to the end to create these
derived indexes. You might think, "Well, how often does that happen?" Well,
every time the node goes down or you have to upgrade a version or you deploy,
this is going to happen unless you do something even more complicated to keep
track of that memory. So this is our biggest problem right now: cold queries are
slow, and the query nodes are very fragile. As soon as a query node goes down,
you're going to just be down until the entire write-ahead log has been written.
Kaivalya Abte [44:02]:
Yeah.
Simon Eskildsen [44:03]:
What we have to do now is we have to take these derived indexes and start
persisting them to object storage. So what we're going to do is that when we
build that B-tree at the WAL entry forward on JSON or whatever, we also take it
and put it into object storage. So we now have B-tree.json. Let's say we're
serializing it to JSON, and we put that there. At this point, you might be using
some kind of binary serialization, right, of whatever programming language
you're using to just serialize the struct. But let's just say that it's that for
simplicity. Inside of B-tree.json, you're also going to have a pointer to say
this is an up-to-date B-tree as of WAL entry 4.json. So now when you have a cold
query, what we can do is we can just download the B-tree map from object storage
and put that into memory and then replay the log again if we're still strongly
consistent from 4.json onwards, right? At some point, we'll flush a new B-tree
map up to object storage. This is better, right? Now on a cold query, we have to
download just this derived data. But you're also going to run into trouble with
that eventually because that is an enormous B-tree map. If you have a database
with millions or tens or hundreds of millions of entries, you have to download
the entire B-tree map and then start searching.
Kaivalya Abte [45:25]:
Yeah.
Simon Eskildsen [45:25]:
Now you're starting to get into the more complicated realm of, "Well, I actually
want to walk the B-tree map on object storage, right? So I want to go there, and
then I want to get the root node of the B-tree with one round trip, right? With
a range read because I know where in the B-tree the root node is. Then I get the
children from the root node, and then I do the actual tree search directly by
negotiating with object storage." The problem with that is that the object
storage P90 is somewhere around 250 milliseconds. It depends a little bit on the
size of how much data you're requesting, but generally, you can get it to this
if you're willing to get a bit of complexity and some hedge requests and stuff.
This is problematic because, you know, you have to search log n layers of this
B-tree. So for a small one, right, like what is it for a million? It's like 20
or something; log 2 of a million is 20. 20 round trips. So 20, like times 250,
or let's say the P50 is 50 milliseconds. So now we can do this in about a second
to walk the B-tree. It's still better than downloading the entire thing and
putting it into memory. You try to do a lot in one round trip and minimize the
number of round trips. So with our B-tree now, we say, "Okay, we're going to
have more leaves per layer," and this is essentially part of what a B+ tree and
some of these optimization and derivatives do is try to create each layer and
make them fatter. So we're going to do that. We might also try to cache some of
the upper layers if we have a lot of data, but now we're down to maybe doing
three or four round trips, right? Then we can get a lot better performance into
hundreds of milliseconds even on a cold query. Now, of course, after the cold
query, it's in memory, and then we only have to walk the WAL and do this
consistent pointer thing, but we're in a pretty good state now.
Kaivalya Abte [47:23]:
So instead of rebuilding the entire B-tree, you're persisting that in object
storage, and whenever something goes wrong or you need to scale up, you can just
read the B-tree and start from there. So maybe there are the writes are not that
high, so you don't have to walk over the entire log but a part of it, and that's
where we are saving. But then all the nodes need to do this because who exactly
is building the B-tree in this architecture? That's still unclear if I
understand correctly. We have some nodes that are writer nodes that write the
data, and then whenever there is a query, some node is building the B-tree, or
who is responsible for building this B-tree?
Simon Eskildsen [48:07]:
Yeah.
Kaivalya Abte [48:07]:
Is there something working on the object storage?
Simon Eskildsen [48:10]:
We haven't dealt with any problems yet that have made us derive away from just
having one binary on one node. So it's really one node doing all of these
things. We can split it, but we haven't from first principles yet encountered
any issue that's causing us to want to do that. So yeah, we could split it.
There are reasons to split ingestion, split reads, split all of these different
things, but we haven't encountered anything yet that requires us to do that,
right? If we want to optimize for reliability, well then, yeah, we need to have
multiple nodes, and we need to do—we need to make sure that we route the table
to where it's most likely to be in cache. But those are not really optimizations
we've done yet. So for now, let's just assume that it's the same node doing
everything. But there become problems with this if you're doing this, if you're
adding things to the B-tree every like a thousand times a second like we were
talking about before. Well, are you also writing the B-tree to object storage
every thousand times a second? Every single time there's a new entry? Probably
not. You probably put some heuristic to say, "I'm only going to write every time
the WAL has grown by like n elements," because otherwise, you have too much put
amplification and too much write amplification. Because every time you write one
byte, you are writing the entirety of the entire database to object storage.
Now, this is where you start to get into building an LSM, right? Because if this
is a problematic trade-off for you, you can't just keep writing the entire
B-tree into one object on object storage. You have to start splitting it. And
that's why you build an LSM, right? At the end of the day, the idea behind the
LSM is just instead of writing the B-tree map, streaming out to one file and
keeping everything there, we flush batches of updates and then merge them
together at read time. So I might say, "Okay, the WAL has grown by like a
thousand twenty-four elements; I'm going to build a new B-tree for just that,
and I'm going to put that into object storage." When I do a query, I have to
read both of those files, right? So that's how we start to see the pattern. This
is not really an LSM yet; it needs some more elements, but it's that idea. And
then just all the other compromises and problems we encounter later will turn
that into a proper LSM. Because, of course, if you're doing thousands of writes
per second, that means we're doing one more file a second. Every time we do a
read, a cold read, now we have to read potentially thousands of B-trees and do
the queries on those. So we have to merge them periodically, and that's
compaction in the context of fixing a fundamental flaw.
Kaivalya Abte [50:50]:
Okay, interesting. Yeah, so you mentioned reliability. We haven't really faced
any problems when you just have one node doing everything—handling reads,
handling writes, and probably also doing some compaction. Let's dive a little
deeper into what advantages we get by separating out readers, writers, and
compactors because some databases have this separation. Is it more control? Is
it just differentiating the workloads? So probably writes are more important
than reads, and we can scale writes differently. What are the different
trade-offs here when we talk about separating these components into different
nodes versus just having one node doing everything?
Simon Eskildsen [51:33]:
Yeah. So let's walk through every workload first. So for the compaction, right,
which is fairly—like I was about to say labor-intensive, but it's a fairly
resource-intensive activity. You have to look; you have to perhaps you're
compacting the WAL; perhaps you're compacting these B-trees, or it could be an
approximate nearest neighbor index, or it could be other types of derived
indexes. In the case of an LSM, it's typically an SSTable, but it doesn't matter
too much; it's basically just a sorted list of things. So the compactor
node—it's nice to separate that because often what you find in these LSM-based
approaches is virtually about object storage because of this append-only nature
of the data. We like to separate that out so that they can once in a while
download the data, compact it, and then change the pointers so that we use the
new data. So that's a really good workflow to separate out. In the context of
something like turbopuffer, we call this indexers. Yes, they compact, but the
primary thing that they're doing is building these approximate nearest neighbor
indexes, which makes compaction look like a very easy job in terms of resource
consumption. So that the compactor might be doing the same thing; it's
compacting data; it's also indexing it. Indexing into a B-tree is such little
computation that you might as well do it on the query node because it's just an
n log n operation. But for an approximate nearest neighbor index, the
computation is much more expensive. It's the same with an inverted index versus
full-text search.
Kaivalya Abte [53:07]:
Mm-hmm.
Simon Eskildsen [53:07]:
You can choose that depending on those compromises. Then you have the write
workload. At the end of the day, the write workload, if you take the compaction
out of it, is really just a matter of the deserialization of the data coming in
from the user, serializing it into a WAL entry, and then adding it to the WAL.
It really is just a type of data proxying and looks a little bit more like maybe
a streaming, very, very simple streaming workflow than anything else. Of course,
in some databases, you can even go so far as making that like essentially zero
copy by having the client write in the same format as you persist into the
database after some consistency checks. So writes are computationally very easy;
they can be network intensive. So you may want to separate them out at some
point if you start saturating the network a lot on writes, which may impact your
reads, or prioritize reads over writes in like a networking abstraction that you
do yourself, and you can rely on the kernels or whatever you want to do. We
don't separate writes and queries. It would be very easy for us to do it, but
that's why we don't do it. Another reason to not separate writes is that when
you write, say, 4.json at a WAL, you could also write it to the local cache. If
you do that, then it's already in the cache, and you don't have to go to object
storage, which means that a read that's coming in immediately after is already
hot. So that would be another reason to not separate it, and that's a very cheap
operation to do when you already have the data. You don't have to—there's no get
amplification. Then there's the query itself, right, which we talked about.
There may be advantages or disadvantages of separating bringing it from the
write. This is the highest priority task because it's very latency-sensitive in
this type of system. In a traditional OLTP database, the database doesn't really
have any sense of what's more important to write or read in terms of latency.
But as a nature of our system, we know that write latency can't be the most
paramount property of the database because then you wouldn't have chosen this
type of database in the first place. So the queries you would often have on the
same ones, right? A third type or a fourth type of workload other than
compaction, GCing, writing, and reading would be some periodic activities,
right? Typically, this would be done on the compactor. This might be things like
billing or like GCing objects and things like that. But that could be another
workload. Typically, you do it on the compactors, right?
Kaivalya Abte [55:33]:
Yeah, I think that makes sense. I guess we have to look at what kind of
computation is required for writing, what kind of computation is required for
reading, and compacting, right? That's when you'll be able to make a better
decision and provide resources to these nodes. If it's, as you mentioned, data
proxying, you probably don't need that compute-intensive resource; it's more
like a network-intensive operation. But for compaction, it might also be network
and compute-intensive; that depends. And for readers, I guess it's more like a
network-intensive operation. But for compaction, it might also be network and
compute-intensive; that depends. And for readers, I guess it's more like a
memory-intensive operation because it has to cache data. You mentioned making
reads and writes on the same node because then you already have this cached data
because probably it's just written. But imagining multiple nodes doing this, are
we talking about another node that keeps the mapping of who wrote this data or
where are we putting this mapping where it knows this is the node that just
wrote this data, and probably it should also be served from this?
Simon Eskildsen [56:36]:
Yeah, I think the simplest solution to this that I can think of is to do
consistent hashing so that when you come into the load balancer, it looks at the
key that you consistently hash on. It might be something like the tenant ID and
then the table name, right? You hash that to the node. So let's say there's
three nodes; we always go to node A for Kaivalya's data, and we always go to
node C for Simon's data. So that would be how we do it. Of course, at some
point, you're going to run into problems with simple consistent hashing because
maybe your table is 10 billion elements and mine is 10, so again, if we look at
it from the same lens as we did when we thought about how to separate
compaction, GCing, writing, and reading, there's kind of a bar chart of the
resource constraints that something utilizes: CPU, memory, NVMe, and network,
right? Inside of those, you also have things like all the implications that we
talked about before—IO, storage, whatever, right? So at some point, the load
balancer may need to know properties about the system to know whether it should
spread across multiple nodes, whether you should have your own node. There are
various levels of sophistication you can do there. Multi-tenancy, in general, is
a very difficult problem, right? That's what I spend a lot of my time at Shopify
doing, just protecting tenants from each other and making sure that they get a
sizable slice of the pie. You're not going to be happy if it's some major
YouTuber that's releasing—that's dropping some new product, and that means that
your store is down. You're probably going to be even more mad if turbopuffer,
you know, cursor is one of the big workloads, and if their workload causes any
problems for you, that's not a good database experience for you. So this
multi-tenancy issue is, I think, an underrated problem for people who work on
databases. It is supremely difficult to guarantee across all these subtraits
like fair scheduling. Of course, the Linux kernel scheduler does a pretty good
job of balancing things between tasks, but it doesn't know—it doesn't care that
you are the one that, like, Kaivalya's queries have consumed 99% of all query
time over the past 24 hours, and all Simon's queries are slow as a result.
That's not a fair database experience, right? So you kind of have to go in front
of all of these resources that I mentioned and write your own user-like kernel
space scheduling layer so that you know how much you can consume, what's fair,
and keep track of kind of the leaderboard of resource constraints and make sure
that this works out. At the same time too, because a multi-tenancy system has
almost unlimited resources, you may have gone away for the weekend, and for some
reason, your thing is doing 20,000 queries per second, and that's not what you
wanted to be built for. So there's lots of nuances here in how to run a
successful multi-tenancy system, which could be a whole discussion in itself,
and we can go into more depth on that if you like.
Kaivalya Abte [59:45]:
I think that's a pretty difficult problem, and you know many databases do it.
They kind of do some calculations based on heuristics of what kind of workloads
a typical tenant has, and you know, probably pre-scale some computation if they
know this is the peak hours for these kind of tenants, so they have more
resources in place. It also depends on how they are charging the tenants. If
they are charging based on computation and they don't want to stop them,
probably they'll just increase the computation resources and then charge them
accordingly so it doesn't impact the other tenants, but they are also able to
leverage this increase in computation requirements by bumping up the computation
resources. I think that's a very important problem, as you highlighted. If
there's no one solution for that, of course, we are going to start simple by
consistent hashing. Eventually, when we have some problems, we can do more. One
thing that we talked about, and you know, we also laughed a little bit about
that, is storing our files in JSON format. That's because it's not optimal,
right? What are the concerns or trade-offs there, and what are some other
storage formats available that you have explored while building turbopuffer?
Simon Eskildsen [1:00:58]:
Yeah, I think we, you know, JSON is always a great place to start when you're
exploring it.
Kaivalya Abte [1:01:04]:
Yeah.
Simon Eskildsen [1:01:04]:
We still use it for some of the metadata because it's very easy to inspect. It's
very simple to understand. It's very simple to map to any language, and it's
humanly readable. So I wouldn't underestimate it. But the problem with JSON is
that it's very difficult to deserialize more than a couple hundred megabytes per
second. So you have to go through this whole deserialization process, so we can
find formats that have more efficient deserialization, right? Those exist, or
something like Protobufs, for example, which is much faster to deserialize
because there's just less constraints and less conditions that we have to check.
Kaivalya Abte [1:01:39]:
Yeah.
Simon Eskildsen [1:01:39]:
So that could be like kind of a v2; we could do Protobuf. There are a bunch of
open formats as well, right? So there's formats like Parquet, which is of course
one of the most known ones for columnar data. The problem with Parquet is that
it doesn't support random reads, so you can't be like, "I know that ID 15 is in
here; I just want to load that record." Parquet can't really do that because you
have to download the whole file.
Kaivalya Abte [1:02:04]:
Yep.
Simon Eskildsen [1:02:04]:
There are alternative formats that provide these kinds of things, right? So you
could go with one of these open formats. The other option that you can do, and
some of the open formats support this kind of thing, is that you can do kind of
zero copy into the data, right? So what zero copy means is that let's say you
download a file; it's one gigabyte big from—it might be this B-tree.json file,
and now it's not JSON; it's a zero-copy binary blob instead of JSON. You
download it onto the machine, and you just get it in memory, and now your
machine is ready to use it. It doesn't have to interpret it or anything like
that; it just slams it straight into memory after checking that it's not some
malicious thing, and then you can start searching on it. That's the kind of
format that you have to do if you are serious about building a database where
you're using disk and especially data on the network because you don't want to
spend all this time copying data. One thing is the time aspect, which might not
even be that long, but the CPU consumption means that the request per second you
can do is quite low. What we use for turbopuffer is the Rust library rkyv, which
is very flexible in the types that we can express and how we can lay out
archives. We don't use any of the open formats because owning this format
entirely ourselves just allows us to move faster. We didn't find a format that
could do everything that we wanted, so we decided for the time being to do our
own format for the LSM that powers turbopuffer.
Kaivalya Abte [1:03:36]:
Very interesting. But did you explore the Arrow and Parquet formats as well? So
you mentioned that these open formats didn't work well for you. Is it just the
flexibility of the current format, or is there something else that you also, you
know, were concerned about?
Simon Eskildsen [1:03:52]:
Yeah, it really just comes down to the flexibility, right? Like I know exactly,
and the team knows exactly how we want this data laid out on this, the kinds of
data structures that we want implemented. Off the bat, we found things that we
didn't want to do or we wanted to do differently on Arrow and Parquet, and
trying to combine these things. So we made a decision early on to just, you
know, just do those things ourselves. I think down the line that that may
change. Maybe we'll open source the format of turbopuffer. Maybe we will allow
you to replicate from turbopuffer into an open format. Like we really have
nothing against the open format. It is just a question of how can we go fastest
with the most flexibility? And that's why we chose rkyv. With rkyv, you can
implement a completely custom data structure on top of it and get away with it.
We've utilized that quite a bit for some performance-sensitive features and so
on.
Kaivalya Abte [1:04:45]:
Okay. Let's take an example. So you mentioned you can implement a data structure
on top of it. Talking specifically about vector databases, there are specific
indexes required. Let's take an example and explain how this flexibility helps
you in turbopuffer.
Simon Eskildsen [1:05:00]:
Yeah, I think the flexibility is not so much about the ANN index—the approximate
nearest neighbor / vector index. It's more in how we design the LSM and how we
walk all these compromises that we've talked about, right?
Kaivalya Abte [1:05:14]:
Okay.
Simon Eskildsen [1:05:14]:
We can completely control how things are laid out on disk, how many round trips
we can do, and all those types of things. I think some of those formats have not
been designed for minimizing round trips to object storage. We have to do a lot
of bit-twiddling and things like that to make sure we do as few as possible. We
just wanted ultimate control on that front. We can certainly talk a little bit
about how to do an ANN index on top of an LSM or on top of this rkyv-based
format if you want to get into that.
Kaivalya Abte [1:05:44]:
Absolutely. I mean, it deserves a different episode because there's so much to
talk about. For now, I think we have designed and looked at different aspects of
or challenges and trade-offs of building a database simple enough on object
storage and how do you think and, you know, build something like that from first
principles.
Simon Eskildsen [1:06:05]:
Can I just add one thing that I think we didn't get into, but I think it's
important, which is just—I think a lot of people, if listening to this episode,
might be wondering why there are so many object storage databases that are
appearing now, right? Object storage has been around for a long time. So why is
it happening now? I think that there are three events that I have in my mental
calendar and that I think a lot about. I think it is important to have your
listeners understand those because that will help put in context why this is
such a big thing right now. Now, the first one is that NVMe SSDs did not become
generally available on AWS until June of 2018. That's still relatively recent.
Kaivalya Abte [1:06:46]:
Yeah.
Simon Eskildsen [1:06:46]:
GCP had it in 2015. The second event, which I think also most people don't know,
is that S3 did not formally launch itself as consistent until December of 2020.
That is remarkably recent, right? That's basically 2021. It's only like three
years ago. Then I think that the third thing is that S3 didn't have some of
these primitives that we talked about for consensus, like conditional writes
(e.g. put-if-absent), until relatively recently, which help build these systems,
right? So those are the events that have transpired that make this an
architecture—it almost seems obvious now, right? It's like every innovation is
just sort of like the lily pads transpire, and people go further and further out
now, but I think those are very, very important to keep in mind.
Kaivalya Abte [1:07:33]:
In the interest of time, we'll take all the areas of discussions that we have
left in this episode for future episodes and talk a bit more in depth about
vector databases like turbopuffer. I enjoyed the discussion and learned quite a
lot about how you think about each layer by layer and reason about each of these
trade-offs using first principles. So thanks a lot for joining me today and
throwing light on how one can approach building a database on top of object
storage. It was a good learning experience. I also want to plan further episodes
on different types of areas that we have left for this episode. Thanks for
joining, and it was really amazing.
Simon Eskildsen [1:08:16]:
Thank you so much for having me on. There is nothing that I enjoy more than
building something up from first principles. So this was a treat.