Dmitry Kan [0:17]:
Today, as you were preparing your organic, high-mountain Taiwanese oolong in the
kitchenette, one of your lovely co-workers mentioned that they were looking at
adding more Redises because it was maxing out at 10,000 commands per second,
which they were trending aggressively towards. You asked them how they were
using it—were they running some obscure O(n) command? They'd used eBPF probes to
determine that it was all GET and SET. They also confirmed all the values
were about or less than 64 bytes. For those unfamiliar with Redis, it's a
single-threaded in-memory key-value store written in C. Unfazed after this
encounter, you walk to the window; you look out and sip your high-mountain
Taiwanese oolong. As you stare at yet another condominium building being built,
it hits you: 10,000 commands per second. 10,000. Isn't that abysmally low?
Shouldn't something that's fundamentally just doing random memory reads and
writes over an established TCP session be able to do more? Hello there, Vector
Podcast is back, season 4, and we are kicking off with an exciting topic and
guest, Simon Eskildsen, CEO of turbopuffer. I've been watching you guys from,
you know, almost from the start, just following each other on Twitter like
virtual friends. And it's funny that before this episode, you're the CEO of the
company, and before this episode, you tried to sell turbopuffer to me and said,
"Hey, why don't you use it?"
Simon Eskildsen [2:06]:
It'll all come to pass. Yeah.
Dmitry Kan [2:08]:
Facts for sure. But tell me—hey, welcome. First of all, welcome, and thank you
very much for coming on.
Simon Eskildsen [2:15]:
Thank you.
Dmitry Kan [2:16]:
It's a tradition to usually start with the background. If you could speak in
your own words about yourself, your journey, I know that you've worked at
Shopify at some point, you know, also scaling databases, I guess. Right. But
I've also been following your napkin math newsletter. I was reading maybe I'll
quote some text today from there just to amuse and excite our audience, but tell
me about yourself.
Simon Eskildsen [2:46]:
Yeah, I can give a very brief overview and if we can dig into anything, if
there's anything that stands out. I started programming when I was a teenager.
Similar to you, English was not my first language, so at some point I exhausted
the Danish web and then like delved into video game addiction for three years as
a teenager to learn enough English to sort of, you know, get my own chat GPT
moment and take off point. And then I spent a lot of time in high school being
not very good at competitive programming, but good enough to qualify for the
small country of Denmark. And then I spent almost a decade working at Shopify
doing mainly infrastructure work. So when I joined Shopify and the
infrastructure team, we were doing, I mean, it was not even an infrastructure
team, like DevOps was just becoming a thing. And we were driving just a couple
hundred requests per second. And by the time I left, we saw peaks of more than a
million. And I've more or less worked on all of the stateful systems that power
that because they generally tend to be the bottleneck, just playing whack-a-mole
every single year for every Black Friday for many years. And I spent the
majority of those years on one of the last resort pagers for Shopify as well.
Those pager shifts were very scary in the middle of the night, and where a lot
of money, of course, runs through Shopify, so very high responsibility on that.
I left in 2021 and kind of jumped around my friends' companies, helping them
with various things. And I'd spent almost my entire career at one company. So I
wanted to dabble and just, you know, go and basically help my friends with any
infrastructure challenges they had. And in 2023, when ChatGPT launched and the
APIs launched, I was working with my friends at this company called Readwise.
They have a product similar to Pocket and others for reading articles later, a
phenomenal product. They asked me to build a recommendation feature for
articles. And it's like, well, that's perfect, right? LLMs or embedding models
are basically just LLMs with their heads chopped off and that are trained on
exactly this data. So we built something and it actually worked pretty well for
just recommending articles. But then I ran the back of the envelope math on what
it would cost to do this for the entire article catalog, right? Hundreds of
millions of articles. And it would have cost more than 30 grand a month to do.
And for a large company, that's not a big deal for an experiment, but this was a
company that was spending three grand a month on a Postgres instance that prior
to working on this I tuned, and spending 10 times that on just recommendations
and possibly search was just untenable. So it sort of lost traction, and it was
a bit sad. And it sort of ended up in that bucket that a lot of companies have
of like, okay, we're going to work on this when it becomes cheaper and then
we'll ship this feature. But it was a bit sad because I was excited about this
feature as a user of the product as well. And I could not stop thinking about
that. Why was it so expensive? And the vector databases at the time were storing
everything in memory. And DRAM on a cloud costs somewhere between $2 to $5 per
gigabyte. And this just, the economics of this didn't line up. It wasn't that
this vector database was doing anything, you know, malicious in their pricing.
They're just trying to earn an honest margin on memory pricing, but memory
pricing was just too high and it stopped this feature in its tracks. And what I
couldn't stop thinking about is why can't we do all of this on top of object
storage, right? Like we just put it all on object storage, that's the source of
truth. And then when we actually need some piece of data, we put it in memory or
even on disk if we can. And I did the math on that and I was like, I think
that's about 100 times cheaper. And of course, that would have been a no-brainer
for Readwise. We could have just bought it and started using it and tried it
out, right? And maybe put way more data in and maybe worked our way up to that
30 grand a month bill, but with a different workload. And so, yeah, I couldn't
stop thinking about it and then eventually started writing the first version
over the summer of 2023, just me alone in the woods of Canada and then launched
it in October of 2023, which is probably where you saw it. I didn't really tell
anyone about it. I was just hacking away. Launched it, did a lot of R&D over
that summer, insights that some of them still are in the product and a lot of
them we've since phased out. But the most important thing was that it launched
and the first version of turbopuffer didn't have, I was just looking at the
website the other day for an unrelated reason, didn't have mutable indexes, so
you just wrote to it and then you called an index endpoint and then you're
locked in, like that's it. And it didn't have any SDKs, it was just a big, you
know, pure HTML website, but it was enough to ship it and it caught the
attention at the time of the Cursor team back in 2023. And of course, this was
early on for Cursor, it was early on for us, and their vector database build did
not line up with their per-user economics and how they wanted to use RAG in
their Cursor. And so they wanted to try to work together and we exchanged a
bunch of emails of bullet points, and it was very clear that they thought that
this architecture was exactly, now knowing the team now, they would have just
sat down at the dining table, done the napkin math over there and then thought,
why hasn't anyone built it like this? And so we worked, I went to San Francisco
and spent some time with them and came up with a bunch of features that they
would need and called the best engineer that I knew at Shopify, my co-founder
Justine, and asked if they'd come on board because I think maybe there's
something here. And yeah, we launched it. Cursor moved over and their bill was
reduced by 95%, and of course the traditional storage architecture they were on
before didn't make sense for the Cursor economics, but our storage architecture
really did because you put all the codebase embeddings on S3 and then the ones
that are actively being used, we can use in RAM or have in disk. I'll stop
there, but that would be what led up to this moment.
Dmitry Kan [9:22]:
Oh, that's an amazing journey. A lot to ask, of course, a lot of questions, but
just on that Cursor thing, as I told you before we started recording, you know,
and then I've listened to the Lex Fridman podcast episode with the Cursor team
and they did mention turbopuffer sort of like in passing, but you know, I think
that also probably created a lot of attention to you guys. But I'm just curious,
like how did you get together? How did you know the Cursor team? Somehow someone
on the Cursor team that you could like partner early on and essentially help,
you know, they kind of like helped you to pioneer it, right? In some sense,
becoming the first client or maybe future client, right? How did you approach
them?
Simon Eskildsen [10:13]:
They did. I mean, they were a design partner in every sense of the word, right?
We had a Slack channel, and I feel like they treated us as part of their team,
and we treated them as part of our team. They came inbound. They sent an email
based on the website, and they said, "Hey, we would need mutable indexes and
glob and a couple of other things." So it's like, well, those are very
reasonable requests, right? And I think they had the conviction that this was
the right architecture. I guess we could prove their trust and then you would be
in a good place. So it was really just an honest conversation, just the way that
the website is today, a very honest description of what are the trade-offs, what
can it do, what can it not do, what is the latency profile, what are the
guarantees, and that's exactly the kind of bullet point discussion that we
engaged in over email before I met the team in person. Yeah, and they, of
course, they were a small team at the time, right? It was a, and they needed
help with parts of their infrastructure and working very, very closely with
teams that they could trust with the right economics and the right reliability.
Dmitry Kan [11:24]:
Yeah, for sure. But I guess that honesty, which I also value a lot, you know, in
my work as I became a product manager, you know, three years ago, and I think it
applies to any discipline, be honest. But you know, like that honesty probably
lies on the fact that you've done your napkin math and you knew where this will
scale, how this can go, right? How did you go about doing that pre-launch,
right, before having any client? Is that at the company of your friends that
helped you to kind of like figure out the economics and sort of the throughput
and all of these rigorous questions that you ask, you know, as problem
statements on napkin math?
Simon Eskildsen [12:04]:
I think that should almost bring up the Internet Archive version of it. The
first version of turbopuffer, I had not thought about the business at all. I
didn't have any launch playbook. I had run, of course, all the economics of what
it would cost me to operate and spent a decent amount of time on the pricing
because that felt like an important thing to spend time on at the time. But
there was really not much more than that. Of course, the Readwise team was very
interested, but at the time I could barely do it, you know, I could just do
around 10 million vectors, which is not enough for their use case. I can screen
share the website with you right here of what it looked like at the time, and
then we can get for the listening audience, we can get your reaction. But it was
very simple. I wouldn't put any sophistication in it. It was honestly, I was
exhausted. I've been working on this like completely alone, not telling anyone
about it, no interested customers for like four months, extremely focused, like
every single day. And I couldn't, like if you ask my wife, she'd say I was very
distracted and she's just like, "Why are you working so hard on this? Like
there's no one on your team, you don't have any customer lineup." And I'm just
like, "Someone has to do this." And I just launched it. I mean, now it feels
embarrassing when we did launch it, just couldn't do that much. It was pretty
slow. I spent a bunch of time actually trying to make it work in Wasm and on the
edge, but it was too hard to make it fast, and a bunch of other false starts
like that on different types of ANN indexing structures we could talk about that
as well and what we settled on. But there was no real sophistication in the
go-to-market. It was really just here it is, here's the napkin math, here's what
it does, let's see how the world takes it.
Dmitry Kan [13:53]:
Yeah.
Simon Eskildsen [13:53]:
But I see, think when you sit on that, well, you didn't sit on it yet, but you
had a cool technology idea in mind, right? You knew, you know, it may play out,
but also of course it required a lot of hard work, like you said. But after
that, after you see it fly like on some small scale or whatever scale, I think
that brings you like that excitement to bring it to the world, right? So yeah, I
see you sharing the screen of the web archive page.
Simon Eskildsen [14:32]:
Yeah, that's it. Very simple.
Dmitry Kan [14:35]:
Yeah, that's awesome. But yeah, that's actually a good segue to... You know, you
probably know I've been at the emergence of the field of vector database field.
I've been, I think I was the first probably to write just a simple blog post
with like, you know, these short snippets of what each vector database did and
how they stand out and so on. turbopuffer wasn't there because turbopuffer was
still in your mind, I think. But the segue here is I don't have it covered in
that blog post, but in your mind, why were you not happy with the vector
databases like at large? Did you try all of them? Did you try some of them? Why
did you think that a new vector database deserves to exist?
Simon Eskildsen [15:28]:
Yeah, I think it really just came back to the Readwise example, right? They
looked like great products. I really liked the API of many of them. They had
lots of features that it would take me a long time to build, but even features
that we don't have today, although we have a lot of features today compared to
when we launched, it came out of the cost piece that it felt that there was a
lot of latent demand built up in the market of people who wanted to use these
things, but it just didn't make sense with the economics. It's very difficult to
earn a return on search. I mean, I remember the search clusters at Shopify were
very expensive, but e-commerce is a lot about search, so it was okay, right? But
for a lot of companies, search is an important feature, but it's not the
feature, right? And so the per-user economics just have to make sense. It's not
that everyone just wants it in the cheapest possible way, it's that if you
invest in infrastructure, you have to get a return on that investment. And it
felt that I knew that at Readwise they could get a return on that investment,
but it wasn't on 30 grand a month, it was maybe closer to three grand or five
grand a month that they would feel that they could earn a return on that feature
in terms of conversion, engagement, and whatever. So it was really about the
storage architecture. And I think that when I think about databases now, this
was not as coherent to me at the time. At the time, I was driven by the napkin
math, not the market, nothing else. It was based on one qualitative experience
and a napkin math. There was nothing else in it. I can speak about it in a more
sophisticated way now, being, you know, having learned a lot about go-to-market
since, but that's really all it was at the time. It was an insight on those two
things. The best ideas, right, are simultaneous inventions, right? Someone else
would have done it six months later, probably other people were doing it at the
time that launched later, right? We were the first to launch with this
particular architecture, but it was out there for the grabbing, right?
Dmitry Kan [17:28]:
Yeah.
Simon Eskildsen [17:28]:
The idea was in the air, like S3 had the APIs now finally. So the way that I
think about this, to really boil this down, is that if you want to create a
generational database company, I think you need two things. You need a new
workload. The new workload here is that we have almost every company on earth
sits on their treasure trove of data and they want to connect that to LLMs,
especially all the unstructured data that it's always been very difficult to do.
We did this for structured data in the 2010s. The new workload was that we
wanted to do analytics on billions, tens of billions, trillions of rows of
structured data. But now with LLMs, we're entering into that with the
unstructured data. That's the first thing. We need a new workload because that's
when people go out shopping for a new database. The second thing that you need
is a new storage architecture. If you don't have a new storage architecture that
is fundamentally a better trade-off for the particular workload, then the
natural move is to tack a secondary index onto your relational database, OLAP
stack, or existing search engine. I would have made that decision in the shoes
at Shopify, right? It's like, well, this database has a really good vector
index, but it doesn't bring anything new in terms of the storage architecture,
so we're just going to invest in the MySQL extension, right? That's just what we
would have done at Shopify—same thought process, right?
Dmitry Kan [19:01]:
Mm-hmm.
Simon Eskildsen [19:01]:
These are great databases. They've stood the test of time, and when you're on
call, you become very conservative in what you adopt for new workloads. But you
cannot ignore a new storage architecture that is an order of magnitude cheaper
than the previous one. When you store a gigabyte of data in a traditional
storage engine, you have to replicate that to three disks, maybe two if you have
a little bit of, if you have more risk tolerance, but likely three. A gigabyte
of disk from the cloud vendors costs about 10 cents. You run it at 50%
utilization, otherwise it's too scary to be on call, 20 cents per gigabyte.
Times three for all the replicas, 60 cents per gigabyte. Object storage is two
cents per gigabyte, right?
Dmitry Kan [19:43]:
Yeah.
Simon Eskildsen [19:43]:
It's 30 times cheaper. If it's all cold, now by the time you have some of it in
SSD and you have it in memory, then the blended cost ends up being different,
but it tracks the actual value to the customer. Even if you have all of that in
disk, well, you only need one copy, right? And that disk you can run at 100%
utilization, meaning the blended cost is now 12 cents per gigabyte, right? So
the 10 cents, 100% utilization plus the two cents per gigabyte for object
storage. So now you have the ingredients of a new actual database. You have a
new workload, right? Which means that people are out there trying to look for
ways to connect their data to LLMs, and then you have the second ingredient,
which is a new storage architecture that allows them to do it an order of
magnitude easier and cheaper than what they can do on their existing
architectures. And this matters because vectors are so big. Go big, right?
Dmitry Kan [20:34]:
Yep.
Simon Eskildsen [20:34]:
A kilobyte of text easily turns into tens of kilobytes of vector data.
Dmitry Kan [20:38]:
Yeah, yeah, that's absolutely true. One other thing that I kept hearing or kept
hearing about, you know, whether or not to introduce a vector search in the mix
for some really heavy workloads is that it will bring certain latency on top
that we cannot tolerate, right? For example, if you run a hybrid search like you
guys have implemented as well, you know, one of these will be slowest and
therefore you will have to wait for that slowest component. And so if it adds, I
don't know, a few hundred milliseconds on top of your original, you know,
retrieval mechanism, then it's going to be a non-starter. What's your take on
that? Have you thought, obviously you have thought about that. What's the edge
that turbopuffer brings in this space over maybe pure databases?
Simon Eskildsen [21:30]:
Yeah, I think there's two types of ways that people adopt vector databases or
turbopuffer. We don't consider turbopuffer a pure play vector database. We
consider it a search engine. We actually consider it a full database because
there's a full generic LSM underneath all of that. And we consider that the
actual asset of turbopuffer is an LSM that's object storage native and doesn't
rely on any state. We just think that the vector index and the search engine
index is what the market needed the most. So let's speak about latency. There is
no real fundamental latency trade-off with this architecture. The only thing is
that once in a while you will hit that cold query, but the entire database is
optimized around minimizing the amount of round trips that you do to S3. S3, you
can max out a network card, right? So you can get on a GCP or your AWS function,
get 50 to 100 gigabits per second of network bandwidth—not gigabytes per second
of network bandwidth. So this is similar to disk bandwidth, but the latency is
actually even better in the clouds often than disks, even with SSDs, even than
NVMe SSDs. So the network is phenomenal. You can drive, say, you can drive all
of that data, you can drive gigabytes of data per second in a single RAM strip.
So you can get greater throughput, but the latency is high. The p90 might be
around 200 milliseconds to S3 for every round trip, somewhat regardless of how
much data that you transfer, assuming you're saturating the box. We've designed
almost everything in turbopuffer around minimizing the number of round trips to
three to four. That doesn't just help for S3, it also helps for modern disk
where it's the same thing. You can drive enormous amounts of, it depends on
bandwidth, but the round-trip time is long, right? It's like hundreds of
microseconds versus hundreds of milliseconds, but still substantial compared to
DRAM. The latency trade-off is not a fundamental trade-off with this
architecture. By the time that it makes it into the memory cache, it's just as
fast as everyone else. We have found that people don't care if it's like a
millisecond or five milliseconds. As long as it's reliably less than around 50
milliseconds, they're good, right? And I think that a lot of the traditional
storage architectures, especially because of the sharding structure with
multiple nodes, you're already in a worse position than going to two systems
where if you write a query on some of the traditional search engines, generally
you touch five, ten, maybe more nodes depending on this because the shard size
is very, very small. But you go into more depth on that, so you already have
this problem. What we see is that there's two types of ways that people adopt
it. So the first one is you have an existing lexical search engine. You are
having a hard time running it because of this traditional, like very stateful
architecture, and they're reputed for just being difficult to run. And you're
like already a little bit at your threshold for the amount of money that you're
spending on this cluster. And if you put the vector data in, it's often 10 to 20
times larger than the text data. It is just, it's a project that stops in its
tracks, similar to the Readwise case that I mentioned before. So for those
players, we often see that they have something that's really well-tuned for the
lexical and they adopt a vector store, and then they do two queries in parallel.
The vector store should not be slower than the lexical, right? So these are just
two futures that you merge together in userland. And in general, we see that our
customers are actually quite happy to move some of the ranking and the final,
like second stage ranking out of the search engine and into a search.py instead
of a big search.json, which can be very difficult to maintain. Many of these
companies express a lot of desire to move more and more of their lexical work
also onto turbopuffer, and we have a full-text search engine. We don't have
every feature of Lucene yet, but we're working very, very actively on bringing
this up. What we also see is that a lot of our customers don't need all of the
features of Lucene anymore because the vectors are so good that a lot of the,
you know, PhD-level efforts we did before to turn strings into things is not as
much of an issue anymore. And really what we use strings for more is that when
you search for "Dmitry Kan," you get "Dmitry Kan," right? Like for a prefix
match, whereas an embedding model might think you're talking about something
else entirely. Those kinds of things are important, and we still need string
matching for that. Lots of applications need it, but there's a lot of things
that we do in Lucene with synonyms, with stemming, with all these kinds of
things that the models are frankly just a lot better at. So we find that this is
an adoption curve that is there. A lot of the newer companies just start with
embedding models and simple full-text search, and they get it up and running on
turbopuffer, and they like that. They just pay for what they need, they don't
think about it, and they could pump a petabyte of data in if they wanted, and it
would be extremely competitive on pricing, and they don't have to think about
it.
Dmitry Kan [26:26]:
Oh, that's awesome. That's awesome. Actually, I forgot to mention, I forgot to
ask you, which language did you choose to implement turbopuffer?
Simon Eskildsen [26:34]:
Yeah, we... Well, it was just me at the time, but I chose Rust.
Dmitry Kan [26:40]:
Mm.
Simon Eskildsen [26:40]:
And I think I'd spent the majority of my career writing Ruby at Shopify and then
a lot of Go as well for some of the infrastructure components. And then mainly
debugging C, which all the databases that we were using were doing and reading
C. I really like Go and I like... Go sat alongside Ruby at Shopify because Go
was one of those things as when leading teams, I didn't have to worry about
whether someone knew Go or not because the adoption to learn it is two weeks.
The adoption to learn Rust and being proficient in it is months, right? And
someone that's written Rust for two years is a lot more productive than someone
who's written it for two months in the language. And that's just not the case
for Go. Like someone who's spent two years in it is just not that much more
productive. I think that's an amazing feature of the language. From my own point
of view and from the napkin math point of view, I just, I was always so hungry
having been inside of runtimes in the Ruby MRI runtime and inside of the Go
runtime, I was just hungry to just get directly connected to the metal of the
machine. And so, and for a database in particular, that was very important,
right? We need to vectorize everything. We need full control over that. And I
think that full control, as remarkable now as Go is, which would, I think it
would have been okay, that raw access to the machine has been needed for writing
something like turbopuffer.
Dmitry Kan [28:11]:
Yeah, yeah, for sure. I still remember coding the times when I was learning and
coded industrially in C and C++ like you. Like you really needed to be very,
very careful, but in return you can get a lot of like performance gains, you
know, and some of your ideas really fly. But yeah, today I guess I'm coding more
in Python or should I even say that I code in Python when I use Cursor more and
more, which is, by the way, scary, you know, the feeling when some other entity
writes code and you are just reading it, right? It's a little bit scary and I'm
still grappling with it, but the amount of productivity that I get is enormous
and it's like, you know, I can ship daily features and just see them being used.
That's amazing.
Simon Eskildsen [29:02]:
I think what I love about it is that I still love to sit there and write the
artisanal code by hand. You know, maybe at some point we will mark turbopuffer
as an artisanally written database because we don't use a ton of AI for the very
key parts because, I mean, we're at the edge of what the LLMs could know. But I
think that for me in a position where I'm in and out of meetings all day these
days, but I can actually get a lot done in a 30-minute window when I have
something that's prompting and writing the tests, right? And you head off at the
beginning of the meeting, you check in and they're like, you know, 15, 30
minutes you have in between blocks, and this allows me to actually contribute a
lot more code than I was otherwise going to be able to not into the core engine
that, you know, I don't get let into a lot of that anymore because I don't have
the time and focus that it takes to fully think something through there. But for
the website, the API, tangential features, all of that, it's just been
wonderful.
Dmitry Kan [30:01]:
Yeah, that's amazing. I also wanted to go a bit on a tangent. Like you
essentially, you could say mathematician engineer, but you took a leap towards
becoming a CEO, right? And I think, you know, as you said, you go to meetings,
you do lots of, you know, probably sales and product and all of that stuff. Was
it a natural transition for you? Like what have you learned in this journey and
what maybe do you miss from your previous career when you were like, you know,
hands-on and sit down and write a bunch of code?
Simon Eskildsen [30:43]:
I think I have a couple of angles to answer the question, but not necessarily a
direct answer. I think one angle is that fundamentally I'm like a growth junkie
for better or worse. And I think that entrepreneurship is the ultimate path for
a growth junkie. It was never really something that I assumed that I was going
to do. I've never, before, even when I was working on the project, it was never
about becoming a founder, it was just about creating the database, right? And at
some point, becoming the founder of the company becomes a means to an end of
creating the database and getting it into the hands of our users and making sure
they have a great time. That's always what drove me, right? Was the Readwise
should have this, right? Our customers should have this. They should have a
great experience. And that's always what's driven me. And to me, the founder and
all of the other things have been a means towards an end there. I think that one
of the things that is maybe both controversial but also feels like a true
statement is that at some point I feel a bit numb to what work that I enjoy and
what I don't enjoy anymore. Because what I enjoy the most is making this company
successful and making the database successful for our customers. That's what I
care the most about. And I'm, yeah, I honestly, I love sales. I love marketing.
I love the engineering. I love hiring people for the team. I love all of these
things. But it's not a simplistic answer to, oh, I've been coding my whole life.
I think it's more that that is my idle activity. If there is a one to two hour
and there's nothing urgent on, then I'm going to go spend some time in the code
base. It's like, oh, how did Nathan implement this new query-planning heuristic?
That's my idle activity, and I always like to also when interviewing people try
to understand, especially if they're in a more hybrid role, what's your idle
activity? What's the thing that you do when you have one to two hours and
nothing else comes up? Do you gravitate towards the code? Do you start looking
at, do you start writing an article? Do you start playing with the product? What
is that idle activity? And it is code for me. That's what everything is grounded
in. And I think it has a deep influence on how I can lead the company, but I
don't think it's been, I think I often think about something that Taleb said,
you know, the author of "Antifragile" and a bunch of other books, is that you,
the best authors of books are not the ones that sit down and like, you know,
read a bunch of papers, then write a page, then read another paper, write a
page. The best books are written by people who just, you know, go to a cabin and
sit down, write 500 pages and hit publish. Of course, that's not what actually
happens. But if you read the books, it's probably pretty close to what actually
happened. And he just has the citations in his head. And I think about that
often when building this company, that it has felt like I've worked for this my
whole life without knowing it. And I feel every morning that I wake up that this
is exactly what it has led up to. So it's very natural, even if it wasn't a goal
unto itself, that it makes sense with the experience I've had to do exactly
this. And I tremendously enjoy it, but it's not a simplistic answer to do I miss
coding.
Dmitry Kan [33:52]:
No, no.
Simon Eskildsen [33:52]:
I want to make this company incredibly successful, but sometimes I will do it as
a recreational activity.
Dmitry Kan [33:59]:
Yeah, I mean, definitely like when I look at you, like on Twitter, for example,
you come across as a very technical person and you are for sure, right? Even
though in order to grow your business, you need to do a lot of other activities.
But at the same time, I mean, yeah, I don't mean to ask it in a way that, hey,
you regret now that you do sales, you regret not doing more coding, which is not
true. You still do that. And I think that all of the engineers will become
better engineers if they learn the mastery of actually presenting what they do,
right? And then they will not need a middle layer or someone else who will go
and talk to that product manager or whoever else needs to talk to, right? So
they can actually represent themselves. But I also loved how you put it really
eloquently that what is your idle activity, right? What do you, what's your
affinity, what you gravitate to? And I actually can, it resonates a lot with me
because my idle activity when I'm really nervous that I do nothing, especially
on vacations, I start coding, you know, I just go and just, okay, let's just
hypothesize about something. But let's dial back for the architecture. Like when
I look at the architecture page of turbopuffer, it's very simple. It's like
client connecting over, you know, TCP to a database instance and it has just two
components there, memory or SSD cache and the object storage. Tell me a bit
more, so I think our listeners and I mostly know what object storage is, but
tell me a bit more about that memory component, like what algorithm design went
into that, maybe trade-offs and, you know, how frequently you need to do the
round trips to the object storage versus when you actually don't do that.
Simon Eskildsen [35:51]:
Yeah, I think it would be easiest to do this by speaking about the lifetime of a
request as the cache warms up. So we actually start with the write path. And
when you do a write into turbopuffer, it's as simple as you can imagine it. I
mean, at this point, we've optimized parts of it that it's not this simple, but
this is the best way to explain it. When you do a write to turbopuffer, that
write basically goes into a file in a directory called the write-ahead log. So
when you write to a namespace, you can imagine that on S3, it's like slash
namespace, slash, you know, write-ahead log. The write-ahead log is basically
just a sequence of all the writes in order, the raw writes. So you do your
write, and it might be, okay, I'm inserting a document with text "Dmitry Kan"
and one with text "Simon," and those are the two documents. In the simplest way,
you can imagine that this file is called 0.json and the next one is called
1.json, 3.json. That's a database, right? That's just a write-ahead log. And if
you want to satisfy a query, you just scan through all the JSON documents and
you satisfy the query. That's actually a respectable database, and it's not even
that far from the first version of turbopuffer, but of course you have to index
that data as well. So basically, as you can imagine, once many megabytes of data
come in, asynchronously an indexing node will pick it up and put it into the
inverted index for full-text search, for filtering index for other attributes,
and there will be other indexing types in the future. When that happens, it will
put it into slash namespace slash index and just start putting files in there,
right? And then the query layer can then consult those files, right? Instead of
scanning through every single document to find "Dmitry Kan," you can just plop
in and look at "Dmitry Kan" in the inverted index, find the document, and return
it. That's how write works. When a write happens, it will go through one of the
query nodes, and the write will be also written into the cache, right? So both
the memory cache and the disk cache. And when, so when you do a query, you will
go to that same query node, right? There's a consistent hashing, so if there's
three, it's sort of like the same namespace will end up on node one all the time
if it hashes to that node. When you satisfy, when you do a query, it will first
check the caches. If you just did the write, well, it's already there because we
just wrote all the writes into the cache to have this, you know, the
write-through cache, and we will satisfy the query mainly from the cache. If for
whatever reason this namespace is not, maybe you did the write a month ago and
so it's falling out of cache and you do the read, well then we'll read through
cache by going directly to object storage with as few round trips as possible to
get the data to satisfy the query, both from the index and from the WAL. We'll
do range reads directly on S3, right? The old like HTTP range header to get
exactly the bytes we need to satisfy the query and then start hydrating the
cache on the query node so that subsequent queries get faster and faster. And we
can do that at gigabytes per second. We can hydrate the cache even for very,
very, very large namespaces. So that's the general architecture of turbopuffer.
On a completely cold query, it takes hundreds of milliseconds, and on a warm
query, it can take as little as 10 milliseconds to satisfy the query. The last
detail I'll point out, and then we can go into a particular aspect of this, is
that turbopuffer has chosen to do consistent reads by default. This is an
unusual choice for search engines. Lucene doesn't do this unless you turn it on
explicitly. I think they've done more work now for real-time indexing, which to
me is the gold standard, which is why I keep referring back to it. It's a
phenomenal piece of software. And turbopuffer has consistent reads by default,
meaning that if you do a write and then you read immediately afterwards, that
write will be visible. And in order to satisfy that, we can't just rely on the
cache on that node. That node could have died, it could have, you know, the
hashing could have moved because we scaled up. So every single query, we go to
object storage and see what is the latest entry in the WAL and do we have that
entry, right? Is it at 3.json or is it 5.json and do I have that? So we have a
little pointer file that we can look, we can download and look at, right? And
that round trip is basically our p50, like our spans are basically, you know,
often like one to two milliseconds of actual search and then on GCS, depending
on the region, 12 to 16 milliseconds waiting for that consistency check against
object storage. The small object latency is a little bit better, so it's eight
milliseconds, but you can turn this off and you will still get up to, you can
get eventual consistency that's very normal for these databases, like it could
be up to one minute out of date, and then you can see often less than a
millisecond or a millisecond latency observable from turbopuffer by turning off
that check. But we find that this is a very safe default, and I think that
databases should ship with very safe and unsurprising defaults.
Dmitry Kan [41:00]:
Yeah, for sure, for sure. So in that cache, but you also have the, let's focus
only on the vector search part for now, you also have the ANN index. Is that
also stored on S3 and then do you also keep kind of like a replica of it in
memory for quick access and how do you sort of... If it's true, how do you sort
of synchronize the two?
Simon Eskildsen [41:26]:
Both the write-ahead log and the index are, everything is stored on S3. If you
killed all of the compute nodes of turbopuffer in all of our clusters, we would
not lose any data. There is no data on the compute nodes that matter. It's only
transient caching. But we cache everything. Yeah, if you're accessing the index,
we'll cache the index. If you're just accessing the write-ahead log files
because it's so small or there's parts of the data that hasn't been indexed,
then that's also on S3 and goes into the same cache with everything else, right?
Prioritized by the workload to try to get the best performance possible.
Dmitry Kan [42:01]:
Yeah, it's quite smart. So effectively you, like I remember like at some
previous companies when I was running Apache Solr, one of the problems was
always that all of these shards are super cold because they're never used,
right? We still pay for them. But then when the query hits, you incur so much
latency that it's super painful. And so I was always coming up with these ideas,
what if I run some, you know, post-indexing warm-up script that will go and
shoot a bunch of queries to all of the shards just to keep them, you know, up
and running and warm or just cat all the indices on Linux into memory? We've
done that too. That was like 10 years ago, so that was a very strange feeling,
like why do I need to mess with that level of detail? It never actually paid
off. I think what pays off is the smartest way to organize your index and how
you read data backwards. Like essentially when your users really only need fresh
data first, like on Twitter, for example, everyone is really after the recent
tweets and not some archive. And then that was very similar case for us. But
it's very interesting, like you go into so much detail there to make the
database effectively like a living organism, you know, adjusting to the usage.
But you also have multi-tenancy, right? So meaning that the same turbopuffer
deployed across the data centers is going to be used by multiple companies at
the same time unless they demand an isolation. How do you think about that when
they use the same, effectively the same instance, compute and index?
Simon Eskildsen [43:50]:
I'd love to go into the Solr example for just one second before we go into
multi-tenancy. How slow were those queries? Because when you say cold, you mean
that it's not in memory. When I say cold, I mean that it's on S3. What kind of
latency were you seeing?
Dmitry Kan [44:04]:
It was very slow. First of all, it has to do also with the domain specificity,
you know, the queries with Boolean clauses that were very long. And so they
would take some time just to query itself would take a minute to execute on our
like original index design, and that was like just super crazy, right? But it
was also very accurate because it was like sentence-level search, and then I had
to design a new system, new architecture where we could retain the accuracy of
that engine but not have to spend so much money on indexing individual
sentences, so we indexed one complete document, right? I have to change the
algorithm slightly, and so it went to sub-second. It was still, I think it's
still slow, right? But it was much faster, and users started like, like we could
scale the company effectively after that, right? With one minute and 75% of
infrastructure costs were like, you know, shaving off so, but that was part of
the Lucene, you know, munging with the algorithm and changing how it scans the
document. It had nothing to do with the level that you go into, you know, with
turbopuffer, you know, like effectively controlling the whole process there.
Simon Eskildsen [45:28]:
Got it. Yeah, I think the point there is that I think when we do see that some
customers are concerned with disk cache because they've gotten bitten by
basically the way that I would think about it is in some of the traditional
engines, the way that they do IO, if something is on disk, it feels like it's
bad. Like if it's on disk, it's slow, and it really has to be in memory. And so
you sort of have, you know, the pufferfish is either, you know, the buffer phase
is sort of because when it's fully inflated, it's a DRAM, right? When it's
deflated, it's an S3. Well, it only had two settings, right? Either it's in
disk, which is quite slow. And frankly, in some of the traditional storage
engines, I've seen the latency on disk being similar to our latency on S3.
Dmitry Kan [46:10]:
Yeah.
Simon Eskildsen [46:11]:
And so then you have to load it into DRAM. And what a lot of these traditional
databases, they have to do a full copy into DRAM. They can't just like zero copy
off of disk. And then the disks are also quite slow, these old network disks,
right? The NVMe disks are so fast, right? They can drive bandwidth that's
within, you know, a very low multiple of DRAM, right? Tens of gigabytes per
second. But the hardware is cheap, and you still can't take advantage of these
very easily. You can't just put some software on it and just it's going to be
like 10 times faster than an original disk even if it's fundamentally capable of
it because what we found, for example, is that we had to remove the Linux page
cache because the Linux page cache cannot keep up with these disks. So you have
to do direct IO, but when you do direct IO, you don't get coalescing, you don't
get all these other things. So now you have to write your own IO driver, right?
And so you just, databases have not been built to take advantage of it because
they're also like, they're not built to try to do an IO depth, like basically so
many outstanding IO requests they can drive. There's much more throughput to be
had. So there's just a lot of barriers of entry there. So what we find is that
when, again speaking in generic terms here of like, you know, millions of
vectors queried at once, when something is in disk, it's maybe high tens of
milliseconds, mid, you know, 50, 70 milliseconds when it's fully on disk, maybe
lower depending on the query, the machine or whatever. And when it's in memory,
it's closer to 10 to 20 milliseconds, right? This is like, these are not, this
is not bad. Like the user is barely going to notice it. But of course you're
going to get more throughput that way. And then when it's on S3, it's maybe more
like, more like five to six hundred milliseconds, so users usually notice. But a
lot of our customers, like Notion, for example, when you open the Q&A dialogue
and these different dialogues that will query turbopuffer, they will send a
request to tell turbopuffer, "Hey, can you start warming up the cache here in a
way that makes sense?" And by cache, we just mean putting it into disk and
starting with sort of the upper layers of the ANN index and other things to
reduce the time as much as possible. So there's a lot of things that can be done
here that are very, very simple. Together, that means there's barely a
trade-off.
Dmitry Kan [48:26]:
Yeah.
Simon Eskildsen [48:26]:
Let's go back into multi-tenancy, unless you had a follow-up on this.
Dmitry Kan [48:30]:
Yes, let's do that. Like, how do you view multi-tenancy?
Simon Eskildsen [48:34]:
So turbopuffer can run in three different ways. It can run, yeah, in
multi-tenancy clusters. That's what, I mean, that's what Cursor does. That's
what Linear does and many of our customers. So in multi-tenancy, you share the
compute. We could do this so cheaply, right, because we can share the caching,
we can share all of this infrastructure. It's very easy for us to run this way.
So that's the default mode. The cache is, of course, segregated off in different
ways, but is also like shared in ways where if you have a big burst of traffic,
right, you get more of the cache than others. So that's what we, so it's a very
great way of running multi-tenancy. The other thing we do for multi-tenancy to
keep it very secure is that because all the data at rest is in the bucket, you
can pass an encryption key to turbopuffer that we don't have access to unless
it's audit logged on your side where we can encrypt and decrypt the object,
which is logically and from a security standpoint equivalent to you having all
the data in your bucket.
Dmitry Kan [49:42]:
Mm-hmm.
Simon Eskildsen [49:43]:
So this is a very nice primitive that, for example, Linear takes advantage of
because they have full control over their data. They can see when turbopuffer is
accessing it. They can shut it down at any point in time. And they can even pass
that on to any other customers where turbopuffer can encrypt data for Linear
customers on behalf of the customer with the customer's key. This is like
really, really, I think, groundbreaking and underrated in this architecture. You
can, of course, do single tenancy with turbopuffer as well, where the compute is
only for you, or you can do BYOC where we run turbopuffer inside of your cloud
in a way that's like very compliant. We can never see customer data, but we find
that the multi-tenancy with the encryption, which can be done per namespace,
satisfies the security requirements of even some of the biggest companies in the
world.
Dmitry Kan [50:29]:
Yeah, that sounds awesome. I also wanted to pick one topic which usually used
to, I don't know if anymore, I don't see that as much, pick up a lot of flame
discussions. What is your recall at n? And when I go to the docs of turbopuffer,
it says recall at n is 100%. Recall at n, excuse me, for vector search part. So
does that mean?
Simon Eskildsen [50:54]:
Not 100%. We said 90 to 100, right?
Dmitry Kan [50:56]:
No, I think it says, wait, wait, wait, wait, I'll need to... What was the page
where you do that? Oh, here, the limits, I guess.
Simon Eskildsen [51:05]:
Oh, I see observed in production. Yeah, it should say up to 100%. That's a bug
in the docs that I shipped last night. I'm going to fix that after this.
Dmitry Kan [51:13]:
Awesome.
Simon Eskildsen [51:13]:
But what it says in the limits is 90 to 100%. But let's talk about recall. I'd
love to get into recall. So I think recall is incredibly important. It's the
equivalent of your database. You have to trust your database to do it in the
same way that you have to trust your database to fsync and you have to trust
your database that when we say that, hey, we don't return a success to you
unless it's committed to S3. You have to trust that. Recall is similar, right?
If you are working on search and you're working on connecting data to LLMs, then
you don't want to worry in your evals on whether your vector database is giving
you low recall. It's actually a very sophisticated problem to evaluate whether
this is the cause. So you have to trust your vendor. This is an underrated
problem. And I love that you're asking about it, and very few people ask about
it unless they're quite sophisticated. So let's go into a long answer here for
your audience because I think this is paramount. Most databases that have a
vector index are trained on or not trained on, but they're benchmarked against
for these different ANN open source projects. So there's SIFT and others. The
problem with these data sets is that they do not represent what we see in the
real world. A lot of them are very low dimensionality. Like when we do
benchmarking on a billion that we're working on right now, the biggest data sets
we can find are like 64 dimensions. This is not what people are doing in
production. They're doing at least 512, often generally I'd say the average is
around 768 dimensions. These are not representative data sets. And the
distributions in the academic benchmarks are also completely different, really
different from what we see in real data sets, right? In real data sets, we see
millions of copies of duplicates, right? We see filtering, all these chaotic
environments that do not present themselves in the academic benchmarks. So if
you're using a vector index that's only been tested on academic benchmarks,
it's, I mean, it's like the LLMs, right? It's like you don't really trust it
just based on the scoring. It's sort of, you, it's all the vibes, right? It's
all the qualitative thing, right?
Dmitry Kan [53:21]:
Right.
Simon Eskildsen [53:21]:
Outside of the benchmark, because it will work for your domain, right? Like the
LLM.
Dmitry Kan [53:25]:
That's right. Like early on, very, very early on in turbopuffer's history, in
the first month, I was mainly iterating against the SIFT data set, right? Just
like 128-dimensional data set. I didn't know anything about ANN at the time,
just like, okay, this is pretty good. We can tune some specifics on this, and
then I can go wider. But I have a feedback loop. And the observation I had at
the time was that I found that one, so I got something that worked really well,
great heuristics on SIFT. And then when I went on the other data sets, it just
completely did not work well or generalized to the other data sets. And I think
that taught me an early lesson that these academic data sets are just not
enough. And the only way to know what your recall is going to be is to measure
it in production. This is what turbopuffer does. For a percentage of queries, it
depends on the number of queries that you do, but let's say around 1% of
queries, turbopuffer will run an exhaustive search against the ANN index on a
separate worker fleet. We will then emit a metric to Datadog that is the recall
number, right? Like, which is basically, okay, this is the top 10 that we know
is accurate versus the heuristic ANN top 10, and you measure the overlap. And we
will average that over time. I have a graph in Datadog that shows all the
different organizations that have more than 100 queries in the past hour or
whatever. And then we have the recall for all of them. We have the recall at
whatever k they ask for—the @10 recall, the p90 recall, and we try our best to
make sure that this is green at all times and be considered green. Anything
above 90% is generally quite good. Well, 90% is quite good for some queries, but
for simpler queries, often it's closer to 100%. Many of our customers have 99.5%
recall. So this is the only way that we know to do this. And it's funny you
asked this question today because last night I was hacking on putting this into
the dashboard. So literally putting the recall that we observe from this
monitoring system into the dashboard of the user because we think it's that
important and it's very difficult to get right. We have spent thousands of
engineering hours to make sure that the recall is high. Now recall on academic
benchmarks, easy. Recall on raw ANN search, especially on academic benchmarks,
very easy. Raw recall on production data sets, I'd say medium to medium hard.
High recall on ANN queries with filters, with mixed selectivity and incremental
indexing, absolute hard mode. This is what you just slap a secondary vector
index onto an existing database. This is what they can't do. They can't sustain
like a thousand writes per second with high recall in the face of very difficult
filter queries. So let's talk about filtered recall for a second. There is
barely any academic data sets on this, yet it's all the production workloads.
What a filtered ANN index means is that let's say that, for example, you have an
e-commerce store and you're searching for, I don't know, yellow, right? And you
want to only get things that ship to Canada. That cuts the clusters in different
weird ways that might end up with a selectivity of 50%. And so if you just visit
the closest whatever vectors with some heuristic you have, you're not going to
get the true ANN because you actually have to search maybe twice as many, maybe
three times as many vectors to get the right recall. The query planner, the
thing in the database that decides where to go on disk and figure out the data
and aggregate it all together to return it to the user needs to be aware of the
selectivity of the filter and plan that into the ANN index. Again, if a database
is not really serious about their vector offering, they're not doing this.
They're not measuring it in production. They're not willing to show their users,
and they don't have a full infrastructure in place to measure the recall. So I'd
say we take this extremely seriously, and we don't want our users to have to
guess this. And it's sometimes a thankless job because many, many, many emails
that we see against some of the other vector indexes have very low recall, and
how are users supposed to know? Because running these tests is extremely
difficult.
Dmitry Kan [57:59]:
It is, and it's like, it's as you said, like you need to trust there, right?
Trust your vendor, and it's basically like the, like in some documentation pages
you say the floor or like the bottom line, right? Like beneath which it just
doesn't make sense, right? If the quality isn't there, then why are you even
running this? Yeah, like it's a difference between, you know, finding that
product with those constraints when it exists and actually not finding it,
right? Therefore not buying it and so on and so forth. It's crucial.
Simon Eskildsen [58:30]:
And I think you can never guarantee a recall. You can observe what you are
trying to make it be on every data set, but if you send a billion completely
random vectors with 3,000 dimensions and then hit them with queries under
10%-selectivity filters—and there is no natural clustering because they're
random vectors, you're not going to get 100% recall. That just like completely
breaks every heuristic that's made, right? But all data in production, real data
that people want to search has some natural clustering to it. So that's not a
real benchmark that you can evaluate recall on, right? And so we always take
this seriously, and in POCs and with the monitoring we do, we're looking at
these numbers all the time. But there are like absolute edge cases that can be
very, very difficult. What you have to do too as a database vendor is like it's
a tug of war between we're going to look at more data to try to get high recall
and we're going to try to improve the clustering of the data so that we have to
search less data. And so you're always trying to improve the clustering, and
you're always trying to improve the performance of the database so we can look
at more data to get high recall.
Dmitry Kan [59:37]:
Yeah, for sure. And now that you mentioned filtered-search challenges—Big ANN is
another thread. I don't know if you're aware: there's ANN-Benchmarks, right? But
there's also the Big ANN Benchmarks suite that I happen to have had the pleasure
of participating in. They have one of the data sets, one of the tasks they have
is the filtered search. I have not participated in that one. But again, as you
said, it's kind of like academic, but some of the data sets are quite large, you
know, like billion points, dimensions are not that huge—on the order of a couple
hundred.
Simon Eskildsen [1:00:11]:
That's the thing—they're often only in the 100 to 256 dimension range, not the
512 or 768 you typically see in production.
Dmitry Kan [1:00:13]:
Right. They are real data sets, but they're from the past generation of
vectors—the pre–modern embedding era.
Simon Eskildsen [1:00:35]:
Modern embedding models behave so differently on real workloads. We just don't
see people rely on those older benchmark setups in production.
Dmitry Kan [1:00:35]:
That's right.