Economical way of serving vector search workloads

September 18, 2025•Vector Podcast

Transcript

Dmitry Kan [0:17]:
Today, as you were preparing your organic, high-mountain Taiwanese oolong in the kitchenette, one of your lovely co-workers mentioned that they were looking at adding more Redises because it was maxing out at 10,000 commands per second, which they were trending aggressively towards. You asked them how they were using it—were they running some obscure O(n) command? They'd used eBPF probes to determine that it was all GET and SET. They also confirmed all the values were about or less than 64 bytes. For those unfamiliar with Redis, it's a single-threaded in-memory key-value store written in C. Unfazed after this encounter, you walk to the window; you look out and sip your high-mountain Taiwanese oolong. As you stare at yet another condominium building being built, it hits you: 10,000 commands per second. 10,000. Isn't that abysmally low? Shouldn't something that's fundamentally just doing random memory reads and writes over an established TCP session be able to do more? Hello there, Vector Podcast is back, season 4, and we are kicking off with an exciting topic and guest, Simon Eskildsen, CEO of turbopuffer. I've been watching you guys from, you know, almost from the start, just following each other on Twitter like virtual friends. And it's funny that before this episode, you're the CEO of the company, and before this episode, you tried to sell turbopuffer to me and said, "Hey, why don't you use it?"

Simon Eskildsen [2:06]:
It'll all come to pass. Yeah.

Dmitry Kan [2:08]:
Facts for sure. But tell me—hey, welcome. First of all, welcome, and thank you very much for coming on.

Simon Eskildsen [2:15]:
Thank you.

Dmitry Kan [2:16]:
It's a tradition to usually start with the background. If you could speak in your own words about yourself, your journey, I know that you've worked at Shopify at some point, you know, also scaling databases, I guess. Right. But I've also been following your napkin math newsletter. I was reading maybe I'll quote some text today from there just to amuse and excite our audience, but tell me about yourself.

Simon Eskildsen [2:46]:
Yeah, I can give a very brief overview and if we can dig into anything, if there's anything that stands out. I started programming when I was a teenager. Similar to you, English was not my first language, so at some point I exhausted the Danish web and then like delved into video game addiction for three years as a teenager to learn enough English to sort of, you know, get my own chat GPT moment and take off point. And then I spent a lot of time in high school being not very good at competitive programming, but good enough to qualify for the small country of Denmark. And then I spent almost a decade working at Shopify doing mainly infrastructure work. So when I joined Shopify and the infrastructure team, we were doing, I mean, it was not even an infrastructure team, like DevOps was just becoming a thing. And we were driving just a couple hundred requests per second. And by the time I left, we saw peaks of more than a million. And I've more or less worked on all of the stateful systems that power that because they generally tend to be the bottleneck, just playing whack-a-mole every single year for every Black Friday for many years. And I spent the majority of those years on one of the last resort pagers for Shopify as well. Those pager shifts were very scary in the middle of the night, and where a lot of money, of course, runs through Shopify, so very high responsibility on that. I left in 2021 and kind of jumped around my friends' companies, helping them with various things. And I'd spent almost my entire career at one company. So I wanted to dabble and just, you know, go and basically help my friends with any infrastructure challenges they had. And in 2023, when ChatGPT launched and the APIs launched, I was working with my friends at this company called Readwise. They have a product similar to Pocket and others for reading articles later, a phenomenal product. They asked me to build a recommendation feature for articles. And it's like, well, that's perfect, right? LLMs or embedding models are basically just LLMs with their heads chopped off and that are trained on exactly this data. So we built something and it actually worked pretty well for just recommending articles. But then I ran the back of the envelope math on what it would cost to do this for the entire article catalog, right? Hundreds of millions of articles. And it would have cost more than 30 grand a month to do. And for a large company, that's not a big deal for an experiment, but this was a company that was spending three grand a month on a Postgres instance that prior to working on this I tuned, and spending 10 times that on just recommendations and possibly search was just untenable. So it sort of lost traction, and it was a bit sad. And it sort of ended up in that bucket that a lot of companies have of like, okay, we're going to work on this when it becomes cheaper and then we'll ship this feature. But it was a bit sad because I was excited about this feature as a user of the product as well. And I could not stop thinking about that. Why was it so expensive? And the vector databases at the time were storing everything in memory. And DRAM on a cloud costs somewhere between $2 to $5 per gigabyte. And this just, the economics of this didn't line up. It wasn't that this vector database was doing anything, you know, malicious in their pricing. They're just trying to earn an honest margin on memory pricing, but memory pricing was just too high and it stopped this feature in its tracks. And what I couldn't stop thinking about is why can't we do all of this on top of object storage, right? Like we just put it all on object storage, that's the source of truth. And then when we actually need some piece of data, we put it in memory or even on disk if we can. And I did the math on that and I was like, I think that's about 100 times cheaper. And of course, that would have been a no-brainer for Readwise. We could have just bought it and started using it and tried it out, right? And maybe put way more data in and maybe worked our way up to that 30 grand a month bill, but with a different workload. And so, yeah, I couldn't stop thinking about it and then eventually started writing the first version over the summer of 2023, just me alone in the woods of Canada and then launched it in October of 2023, which is probably where you saw it. I didn't really tell anyone about it. I was just hacking away. Launched it, did a lot of R&D over that summer, insights that some of them still are in the product and a lot of them we've since phased out. But the most important thing was that it launched and the first version of turbopuffer didn't have, I was just looking at the website the other day for an unrelated reason, didn't have mutable indexes, so you just wrote to it and then you called an index endpoint and then you're locked in, like that's it. And it didn't have any SDKs, it was just a big, you know, pure HTML website, but it was enough to ship it and it caught the attention at the time of the Cursor team back in 2023. And of course, this was early on for Cursor, it was early on for us, and their vector database build did not line up with their per-user economics and how they wanted to use RAG in their Cursor. And so they wanted to try to work together and we exchanged a bunch of emails of bullet points, and it was very clear that they thought that this architecture was exactly, now knowing the team now, they would have just sat down at the dining table, done the napkin math over there and then thought, why hasn't anyone built it like this? And so we worked, I went to San Francisco and spent some time with them and came up with a bunch of features that they would need and called the best engineer that I knew at Shopify, my co-founder Justine, and asked if they'd come on board because I think maybe there's something here. And yeah, we launched it. Cursor moved over and their bill was reduced by 95%, and of course the traditional storage architecture they were on before didn't make sense for the Cursor economics, but our storage architecture really did because you put all the codebase embeddings on S3 and then the ones that are actively being used, we can use in RAM or have in disk. I'll stop there, but that would be what led up to this moment.

Dmitry Kan [9:22]:
Oh, that's an amazing journey. A lot to ask, of course, a lot of questions, but just on that Cursor thing, as I told you before we started recording, you know, and then I've listened to the Lex Fridman podcast episode with the Cursor team and they did mention turbopuffer sort of like in passing, but you know, I think that also probably created a lot of attention to you guys. But I'm just curious, like how did you get together? How did you know the Cursor team? Somehow someone on the Cursor team that you could like partner early on and essentially help, you know, they kind of like helped you to pioneer it, right? In some sense, becoming the first client or maybe future client, right? How did you approach them?

Simon Eskildsen [10:13]:
They did. I mean, they were a design partner in every sense of the word, right? We had a Slack channel, and I feel like they treated us as part of their team, and we treated them as part of our team. They came inbound. They sent an email based on the website, and they said, "Hey, we would need mutable indexes and glob and a couple of other things." So it's like, well, those are very reasonable requests, right? And I think they had the conviction that this was the right architecture. I guess we could prove their trust and then you would be in a good place. So it was really just an honest conversation, just the way that the website is today, a very honest description of what are the trade-offs, what can it do, what can it not do, what is the latency profile, what are the guarantees, and that's exactly the kind of bullet point discussion that we engaged in over email before I met the team in person. Yeah, and they, of course, they were a small team at the time, right? It was a, and they needed help with parts of their infrastructure and working very, very closely with teams that they could trust with the right economics and the right reliability.

Dmitry Kan [11:24]:
Yeah, for sure. But I guess that honesty, which I also value a lot, you know, in my work as I became a product manager, you know, three years ago, and I think it applies to any discipline, be honest. But you know, like that honesty probably lies on the fact that you've done your napkin math and you knew where this will scale, how this can go, right? How did you go about doing that pre-launch, right, before having any client? Is that at the company of your friends that helped you to kind of like figure out the economics and sort of the throughput and all of these rigorous questions that you ask, you know, as problem statements on napkin math?

Simon Eskildsen [12:04]:
I think that should almost bring up the Internet Archive version of it. The first version of turbopuffer, I had not thought about the business at all. I didn't have any launch playbook. I had run, of course, all the economics of what it would cost me to operate and spent a decent amount of time on the pricing because that felt like an important thing to spend time on at the time. But there was really not much more than that. Of course, the Readwise team was very interested, but at the time I could barely do it, you know, I could just do around 10 million vectors, which is not enough for their use case. I can screen share the website with you right here of what it looked like at the time, and then we can get for the listening audience, we can get your reaction. But it was very simple. I wouldn't put any sophistication in it. It was honestly, I was exhausted. I've been working on this like completely alone, not telling anyone about it, no interested customers for like four months, extremely focused, like every single day. And I couldn't, like if you ask my wife, she'd say I was very distracted and she's just like, "Why are you working so hard on this? Like there's no one on your team, you don't have any customer lineup." And I'm just like, "Someone has to do this." And I just launched it. I mean, now it feels embarrassing when we did launch it, just couldn't do that much. It was pretty slow. I spent a bunch of time actually trying to make it work in Wasm and on the edge, but it was too hard to make it fast, and a bunch of other false starts like that on different types of ANN indexing structures we could talk about that as well and what we settled on. But there was no real sophistication in the go-to-market. It was really just here it is, here's the napkin math, here's what it does, let's see how the world takes it.

Dmitry Kan [13:53]:
Yeah.

Simon Eskildsen [13:53]:
But I see, think when you sit on that, well, you didn't sit on it yet, but you had a cool technology idea in mind, right? You knew, you know, it may play out, but also of course it required a lot of hard work, like you said. But after that, after you see it fly like on some small scale or whatever scale, I think that brings you like that excitement to bring it to the world, right? So yeah, I see you sharing the screen of the web archive page.

Simon Eskildsen [14:32]:
Yeah, that's it. Very simple.

Dmitry Kan [14:35]:
Yeah, that's awesome. But yeah, that's actually a good segue to... You know, you probably know I've been at the emergence of the field of vector database field. I've been, I think I was the first probably to write just a simple blog post with like, you know, these short snippets of what each vector database did and how they stand out and so on. turbopuffer wasn't there because turbopuffer was still in your mind, I think. But the segue here is I don't have it covered in that blog post, but in your mind, why were you not happy with the vector databases like at large? Did you try all of them? Did you try some of them? Why did you think that a new vector database deserves to exist?

Simon Eskildsen [15:28]:
Yeah, I think it really just came back to the Readwise example, right? They looked like great products. I really liked the API of many of them. They had lots of features that it would take me a long time to build, but even features that we don't have today, although we have a lot of features today compared to when we launched, it came out of the cost piece that it felt that there was a lot of latent demand built up in the market of people who wanted to use these things, but it just didn't make sense with the economics. It's very difficult to earn a return on search. I mean, I remember the search clusters at Shopify were very expensive, but e-commerce is a lot about search, so it was okay, right? But for a lot of companies, search is an important feature, but it's not the feature, right? And so the per-user economics just have to make sense. It's not that everyone just wants it in the cheapest possible way, it's that if you invest in infrastructure, you have to get a return on that investment. And it felt that I knew that at Readwise they could get a return on that investment, but it wasn't on 30 grand a month, it was maybe closer to three grand or five grand a month that they would feel that they could earn a return on that feature in terms of conversion, engagement, and whatever. So it was really about the storage architecture. And I think that when I think about databases now, this was not as coherent to me at the time. At the time, I was driven by the napkin math, not the market, nothing else. It was based on one qualitative experience and a napkin math. There was nothing else in it. I can speak about it in a more sophisticated way now, being, you know, having learned a lot about go-to-market since, but that's really all it was at the time. It was an insight on those two things. The best ideas, right, are simultaneous inventions, right? Someone else would have done it six months later, probably other people were doing it at the time that launched later, right? We were the first to launch with this particular architecture, but it was out there for the grabbing, right?

Dmitry Kan [17:28]:
Yeah.

Simon Eskildsen [17:28]:
The idea was in the air, like S3 had the APIs now finally. So the way that I think about this, to really boil this down, is that if you want to create a generational database company, I think you need two things. You need a new workload. The new workload here is that we have almost every company on earth sits on their treasure trove of data and they want to connect that to LLMs, especially all the unstructured data that it's always been very difficult to do. We did this for structured data in the 2010s. The new workload was that we wanted to do analytics on billions, tens of billions, trillions of rows of structured data. But now with LLMs, we're entering into that with the unstructured data. That's the first thing. We need a new workload because that's when people go out shopping for a new database. The second thing that you need is a new storage architecture. If you don't have a new storage architecture that is fundamentally a better trade-off for the particular workload, then the natural move is to tack a secondary index onto your relational database, OLAP stack, or existing search engine. I would have made that decision in the shoes at Shopify, right? It's like, well, this database has a really good vector index, but it doesn't bring anything new in terms of the storage architecture, so we're just going to invest in the MySQL extension, right? That's just what we would have done at Shopify—same thought process, right?

Dmitry Kan [19:01]:
Mm-hmm.

Simon Eskildsen [19:01]:
These are great databases. They've stood the test of time, and when you're on call, you become very conservative in what you adopt for new workloads. But you cannot ignore a new storage architecture that is an order of magnitude cheaper than the previous one. When you store a gigabyte of data in a traditional storage engine, you have to replicate that to three disks, maybe two if you have a little bit of, if you have more risk tolerance, but likely three. A gigabyte of disk from the cloud vendors costs about 10 cents. You run it at 50% utilization, otherwise it's too scary to be on call, 20 cents per gigabyte. Times three for all the replicas, 60 cents per gigabyte. Object storage is two cents per gigabyte, right?

Dmitry Kan [19:43]:
Yeah.

Simon Eskildsen [19:43]:
It's 30 times cheaper. If it's all cold, now by the time you have some of it in SSD and you have it in memory, then the blended cost ends up being different, but it tracks the actual value to the customer. Even if you have all of that in disk, well, you only need one copy, right? And that disk you can run at 100% utilization, meaning the blended cost is now 12 cents per gigabyte, right? So the 10 cents, 100% utilization plus the two cents per gigabyte for object storage. So now you have the ingredients of a new actual database. You have a new workload, right? Which means that people are out there trying to look for ways to connect their data to LLMs, and then you have the second ingredient, which is a new storage architecture that allows them to do it an order of magnitude easier and cheaper than what they can do on their existing architectures. And this matters because vectors are so big. Go big, right?

Dmitry Kan [20:34]:
Yep.

Simon Eskildsen [20:34]:
A kilobyte of text easily turns into tens of kilobytes of vector data.

Dmitry Kan [20:38]:
Yeah, yeah, that's absolutely true. One other thing that I kept hearing or kept hearing about, you know, whether or not to introduce a vector search in the mix for some really heavy workloads is that it will bring certain latency on top that we cannot tolerate, right? For example, if you run a hybrid search like you guys have implemented as well, you know, one of these will be slowest and therefore you will have to wait for that slowest component. And so if it adds, I don't know, a few hundred milliseconds on top of your original, you know, retrieval mechanism, then it's going to be a non-starter. What's your take on that? Have you thought, obviously you have thought about that. What's the edge that turbopuffer brings in this space over maybe pure databases?

Simon Eskildsen [21:30]:
Yeah, I think there's two types of ways that people adopt vector databases or turbopuffer. We don't consider turbopuffer a pure play vector database. We consider it a search engine. We actually consider it a full database because there's a full generic LSM underneath all of that. And we consider that the actual asset of turbopuffer is an LSM that's object storage native and doesn't rely on any state. We just think that the vector index and the search engine index is what the market needed the most. So let's speak about latency. There is no real fundamental latency trade-off with this architecture. The only thing is that once in a while you will hit that cold query, but the entire database is optimized around minimizing the amount of round trips that you do to S3. S3, you can max out a network card, right? So you can get on a GCP or your AWS function, get 50 to 100 gigabits per second of network bandwidth—not gigabytes per second of network bandwidth. So this is similar to disk bandwidth, but the latency is actually even better in the clouds often than disks, even with SSDs, even than NVMe SSDs. So the network is phenomenal. You can drive, say, you can drive all of that data, you can drive gigabytes of data per second in a single RAM strip. So you can get greater throughput, but the latency is high. The p90 might be around 200 milliseconds to S3 for every round trip, somewhat regardless of how much data that you transfer, assuming you're saturating the box. We've designed almost everything in turbopuffer around minimizing the number of round trips to three to four. That doesn't just help for S3, it also helps for modern disk where it's the same thing. You can drive enormous amounts of, it depends on bandwidth, but the round-trip time is long, right? It's like hundreds of microseconds versus hundreds of milliseconds, but still substantial compared to DRAM. The latency trade-off is not a fundamental trade-off with this architecture. By the time that it makes it into the memory cache, it's just as fast as everyone else. We have found that people don't care if it's like a millisecond or five milliseconds. As long as it's reliably less than around 50 milliseconds, they're good, right? And I think that a lot of the traditional storage architectures, especially because of the sharding structure with multiple nodes, you're already in a worse position than going to two systems where if you write a query on some of the traditional search engines, generally you touch five, ten, maybe more nodes depending on this because the shard size is very, very small. But you go into more depth on that, so you already have this problem. What we see is that there's two types of ways that people adopt it. So the first one is you have an existing lexical search engine. You are having a hard time running it because of this traditional, like very stateful architecture, and they're reputed for just being difficult to run. And you're like already a little bit at your threshold for the amount of money that you're spending on this cluster. And if you put the vector data in, it's often 10 to 20 times larger than the text data. It is just, it's a project that stops in its tracks, similar to the Readwise case that I mentioned before. So for those players, we often see that they have something that's really well-tuned for the lexical and they adopt a vector store, and then they do two queries in parallel. The vector store should not be slower than the lexical, right? So these are just two futures that you merge together in userland. And in general, we see that our customers are actually quite happy to move some of the ranking and the final, like second stage ranking out of the search engine and into a search.py instead of a big search.json, which can be very difficult to maintain. Many of these companies express a lot of desire to move more and more of their lexical work also onto turbopuffer, and we have a full-text search engine. We don't have every feature of Lucene yet, but we're working very, very actively on bringing this up. What we also see is that a lot of our customers don't need all of the features of Lucene anymore because the vectors are so good that a lot of the, you know, PhD-level efforts we did before to turn strings into things is not as much of an issue anymore. And really what we use strings for more is that when you search for "Dmitry Kan," you get "Dmitry Kan," right? Like for a prefix match, whereas an embedding model might think you're talking about something else entirely. Those kinds of things are important, and we still need string matching for that. Lots of applications need it, but there's a lot of things that we do in Lucene with synonyms, with stemming, with all these kinds of things that the models are frankly just a lot better at. So we find that this is an adoption curve that is there. A lot of the newer companies just start with embedding models and simple full-text search, and they get it up and running on turbopuffer, and they like that. They just pay for what they need, they don't think about it, and they could pump a petabyte of data in if they wanted, and it would be extremely competitive on pricing, and they don't have to think about it.

Dmitry Kan [26:26]:
Oh, that's awesome. That's awesome. Actually, I forgot to mention, I forgot to ask you, which language did you choose to implement turbopuffer?

Simon Eskildsen [26:34]:
Yeah, we... Well, it was just me at the time, but I chose Rust.

Dmitry Kan [26:40]:
Mm.

Simon Eskildsen [26:40]:
And I think I'd spent the majority of my career writing Ruby at Shopify and then a lot of Go as well for some of the infrastructure components. And then mainly debugging C, which all the databases that we were using were doing and reading C. I really like Go and I like... Go sat alongside Ruby at Shopify because Go was one of those things as when leading teams, I didn't have to worry about whether someone knew Go or not because the adoption to learn it is two weeks. The adoption to learn Rust and being proficient in it is months, right? And someone that's written Rust for two years is a lot more productive than someone who's written it for two months in the language. And that's just not the case for Go. Like someone who's spent two years in it is just not that much more productive. I think that's an amazing feature of the language. From my own point of view and from the napkin math point of view, I just, I was always so hungry having been inside of runtimes in the Ruby MRI runtime and inside of the Go runtime, I was just hungry to just get directly connected to the metal of the machine. And so, and for a database in particular, that was very important, right? We need to vectorize everything. We need full control over that. And I think that full control, as remarkable now as Go is, which would, I think it would have been okay, that raw access to the machine has been needed for writing something like turbopuffer.

Dmitry Kan [28:11]:
Yeah, yeah, for sure. I still remember coding the times when I was learning and coded industrially in C and C++ like you. Like you really needed to be very, very careful, but in return you can get a lot of like performance gains, you know, and some of your ideas really fly. But yeah, today I guess I'm coding more in Python or should I even say that I code in Python when I use Cursor more and more, which is, by the way, scary, you know, the feeling when some other entity writes code and you are just reading it, right? It's a little bit scary and I'm still grappling with it, but the amount of productivity that I get is enormous and it's like, you know, I can ship daily features and just see them being used. That's amazing.

Simon Eskildsen [29:02]:
I think what I love about it is that I still love to sit there and write the artisanal code by hand. You know, maybe at some point we will mark turbopuffer as an artisanally written database because we don't use a ton of AI for the very key parts because, I mean, we're at the edge of what the LLMs could know. But I think that for me in a position where I'm in and out of meetings all day these days, but I can actually get a lot done in a 30-minute window when I have something that's prompting and writing the tests, right? And you head off at the beginning of the meeting, you check in and they're like, you know, 15, 30 minutes you have in between blocks, and this allows me to actually contribute a lot more code than I was otherwise going to be able to not into the core engine that, you know, I don't get let into a lot of that anymore because I don't have the time and focus that it takes to fully think something through there. But for the website, the API, tangential features, all of that, it's just been wonderful.

Dmitry Kan [30:01]:
Yeah, that's amazing. I also wanted to go a bit on a tangent. Like you essentially, you could say mathematician engineer, but you took a leap towards becoming a CEO, right? And I think, you know, as you said, you go to meetings, you do lots of, you know, probably sales and product and all of that stuff. Was it a natural transition for you? Like what have you learned in this journey and what maybe do you miss from your previous career when you were like, you know, hands-on and sit down and write a bunch of code?

Simon Eskildsen [30:43]:
I think I have a couple of angles to answer the question, but not necessarily a direct answer. I think one angle is that fundamentally I'm like a growth junkie for better or worse. And I think that entrepreneurship is the ultimate path for a growth junkie. It was never really something that I assumed that I was going to do. I've never, before, even when I was working on the project, it was never about becoming a founder, it was just about creating the database, right? And at some point, becoming the founder of the company becomes a means to an end of creating the database and getting it into the hands of our users and making sure they have a great time. That's always what drove me, right? Was the Readwise should have this, right? Our customers should have this. They should have a great experience. And that's always what's driven me. And to me, the founder and all of the other things have been a means towards an end there. I think that one of the things that is maybe both controversial but also feels like a true statement is that at some point I feel a bit numb to what work that I enjoy and what I don't enjoy anymore. Because what I enjoy the most is making this company successful and making the database successful for our customers. That's what I care the most about. And I'm, yeah, I honestly, I love sales. I love marketing. I love the engineering. I love hiring people for the team. I love all of these things. But it's not a simplistic answer to, oh, I've been coding my whole life. I think it's more that that is my idle activity. If there is a one to two hour and there's nothing urgent on, then I'm going to go spend some time in the code base. It's like, oh, how did Nathan implement this new query-planning heuristic? That's my idle activity, and I always like to also when interviewing people try to understand, especially if they're in a more hybrid role, what's your idle activity? What's the thing that you do when you have one to two hours and nothing else comes up? Do you gravitate towards the code? Do you start looking at, do you start writing an article? Do you start playing with the product? What is that idle activity? And it is code for me. That's what everything is grounded in. And I think it has a deep influence on how I can lead the company, but I don't think it's been, I think I often think about something that Taleb said, you know, the author of "Antifragile" and a bunch of other books, is that you, the best authors of books are not the ones that sit down and like, you know, read a bunch of papers, then write a page, then read another paper, write a page. The best books are written by people who just, you know, go to a cabin and sit down, write 500 pages and hit publish. Of course, that's not what actually happens. But if you read the books, it's probably pretty close to what actually happened. And he just has the citations in his head. And I think about that often when building this company, that it has felt like I've worked for this my whole life without knowing it. And I feel every morning that I wake up that this is exactly what it has led up to. So it's very natural, even if it wasn't a goal unto itself, that it makes sense with the experience I've had to do exactly this. And I tremendously enjoy it, but it's not a simplistic answer to do I miss coding.

Dmitry Kan [33:52]:
No, no.

Simon Eskildsen [33:52]:
I want to make this company incredibly successful, but sometimes I will do it as a recreational activity.

Dmitry Kan [33:59]:
Yeah, I mean, definitely like when I look at you, like on Twitter, for example, you come across as a very technical person and you are for sure, right? Even though in order to grow your business, you need to do a lot of other activities. But at the same time, I mean, yeah, I don't mean to ask it in a way that, hey, you regret now that you do sales, you regret not doing more coding, which is not true. You still do that. And I think that all of the engineers will become better engineers if they learn the mastery of actually presenting what they do, right? And then they will not need a middle layer or someone else who will go and talk to that product manager or whoever else needs to talk to, right? So they can actually represent themselves. But I also loved how you put it really eloquently that what is your idle activity, right? What do you, what's your affinity, what you gravitate to? And I actually can, it resonates a lot with me because my idle activity when I'm really nervous that I do nothing, especially on vacations, I start coding, you know, I just go and just, okay, let's just hypothesize about something. But let's dial back for the architecture. Like when I look at the architecture page of turbopuffer, it's very simple. It's like client connecting over, you know, TCP to a database instance and it has just two components there, memory or SSD cache and the object storage. Tell me a bit more, so I think our listeners and I mostly know what object storage is, but tell me a bit more about that memory component, like what algorithm design went into that, maybe trade-offs and, you know, how frequently you need to do the round trips to the object storage versus when you actually don't do that.

Simon Eskildsen [35:51]:
Yeah, I think it would be easiest to do this by speaking about the lifetime of a request as the cache warms up. So we actually start with the write path. And when you do a write into turbopuffer, it's as simple as you can imagine it. I mean, at this point, we've optimized parts of it that it's not this simple, but this is the best way to explain it. When you do a write to turbopuffer, that write basically goes into a file in a directory called the write-ahead log. So when you write to a namespace, you can imagine that on S3, it's like slash namespace, slash, you know, write-ahead log. The write-ahead log is basically just a sequence of all the writes in order, the raw writes. So you do your write, and it might be, okay, I'm inserting a document with text "Dmitry Kan" and one with text "Simon," and those are the two documents. In the simplest way, you can imagine that this file is called 0.json and the next one is called 1.json, 3.json. That's a database, right? That's just a write-ahead log. And if you want to satisfy a query, you just scan through all the JSON documents and you satisfy the query. That's actually a respectable database, and it's not even that far from the first version of turbopuffer, but of course you have to index that data as well. So basically, as you can imagine, once many megabytes of data come in, asynchronously an indexing node will pick it up and put it into the inverted index for full-text search, for filtering index for other attributes, and there will be other indexing types in the future. When that happens, it will put it into slash namespace slash index and just start putting files in there, right? And then the query layer can then consult those files, right? Instead of scanning through every single document to find "Dmitry Kan," you can just plop in and look at "Dmitry Kan" in the inverted index, find the document, and return it. That's how write works. When a write happens, it will go through one of the query nodes, and the write will be also written into the cache, right? So both the memory cache and the disk cache. And when, so when you do a query, you will go to that same query node, right? There's a consistent hashing, so if there's three, it's sort of like the same namespace will end up on node one all the time if it hashes to that node. When you satisfy, when you do a query, it will first check the caches. If you just did the write, well, it's already there because we just wrote all the writes into the cache to have this, you know, the write-through cache, and we will satisfy the query mainly from the cache. If for whatever reason this namespace is not, maybe you did the write a month ago and so it's falling out of cache and you do the read, well then we'll read through cache by going directly to object storage with as few round trips as possible to get the data to satisfy the query, both from the index and from the WAL. We'll do range reads directly on S3, right? The old like HTTP range header to get exactly the bytes we need to satisfy the query and then start hydrating the cache on the query node so that subsequent queries get faster and faster. And we can do that at gigabytes per second. We can hydrate the cache even for very, very, very large namespaces. So that's the general architecture of turbopuffer. On a completely cold query, it takes hundreds of milliseconds, and on a warm query, it can take as little as 10 milliseconds to satisfy the query. The last detail I'll point out, and then we can go into a particular aspect of this, is that turbopuffer has chosen to do consistent reads by default. This is an unusual choice for search engines. Lucene doesn't do this unless you turn it on explicitly. I think they've done more work now for real-time indexing, which to me is the gold standard, which is why I keep referring back to it. It's a phenomenal piece of software. And turbopuffer has consistent reads by default, meaning that if you do a write and then you read immediately afterwards, that write will be visible. And in order to satisfy that, we can't just rely on the cache on that node. That node could have died, it could have, you know, the hashing could have moved because we scaled up. So every single query, we go to object storage and see what is the latest entry in the WAL and do we have that entry, right? Is it at 3.json or is it 5.json and do I have that? So we have a little pointer file that we can look, we can download and look at, right? And that round trip is basically our p50, like our spans are basically, you know, often like one to two milliseconds of actual search and then on GCS, depending on the region, 12 to 16 milliseconds waiting for that consistency check against object storage. The small object latency is a little bit better, so it's eight milliseconds, but you can turn this off and you will still get up to, you can get eventual consistency that's very normal for these databases, like it could be up to one minute out of date, and then you can see often less than a millisecond or a millisecond latency observable from turbopuffer by turning off that check. But we find that this is a very safe default, and I think that databases should ship with very safe and unsurprising defaults.

Dmitry Kan [41:00]:
Yeah, for sure, for sure. So in that cache, but you also have the, let's focus only on the vector search part for now, you also have the ANN index. Is that also stored on S3 and then do you also keep kind of like a replica of it in memory for quick access and how do you sort of... If it's true, how do you sort of synchronize the two?

Simon Eskildsen [41:26]:
Both the write-ahead log and the index are, everything is stored on S3. If you killed all of the compute nodes of turbopuffer in all of our clusters, we would not lose any data. There is no data on the compute nodes that matter. It's only transient caching. But we cache everything. Yeah, if you're accessing the index, we'll cache the index. If you're just accessing the write-ahead log files because it's so small or there's parts of the data that hasn't been indexed, then that's also on S3 and goes into the same cache with everything else, right? Prioritized by the workload to try to get the best performance possible.

Dmitry Kan [42:01]:
Yeah, it's quite smart. So effectively you, like I remember like at some previous companies when I was running Apache Solr, one of the problems was always that all of these shards are super cold because they're never used, right? We still pay for them. But then when the query hits, you incur so much latency that it's super painful. And so I was always coming up with these ideas, what if I run some, you know, post-indexing warm-up script that will go and shoot a bunch of queries to all of the shards just to keep them, you know, up and running and warm or just cat all the indices on Linux into memory? We've done that too. That was like 10 years ago, so that was a very strange feeling, like why do I need to mess with that level of detail? It never actually paid off. I think what pays off is the smartest way to organize your index and how you read data backwards. Like essentially when your users really only need fresh data first, like on Twitter, for example, everyone is really after the recent tweets and not some archive. And then that was very similar case for us. But it's very interesting, like you go into so much detail there to make the database effectively like a living organism, you know, adjusting to the usage. But you also have multi-tenancy, right? So meaning that the same turbopuffer deployed across the data centers is going to be used by multiple companies at the same time unless they demand an isolation. How do you think about that when they use the same, effectively the same instance, compute and index?

Simon Eskildsen [43:50]:
I'd love to go into the Solr example for just one second before we go into multi-tenancy. How slow were those queries? Because when you say cold, you mean that it's not in memory. When I say cold, I mean that it's on S3. What kind of latency were you seeing?

Dmitry Kan [44:04]:
It was very slow. First of all, it has to do also with the domain specificity, you know, the queries with Boolean clauses that were very long. And so they would take some time just to query itself would take a minute to execute on our like original index design, and that was like just super crazy, right? But it was also very accurate because it was like sentence-level search, and then I had to design a new system, new architecture where we could retain the accuracy of that engine but not have to spend so much money on indexing individual sentences, so we indexed one complete document, right? I have to change the algorithm slightly, and so it went to sub-second. It was still, I think it's still slow, right? But it was much faster, and users started like, like we could scale the company effectively after that, right? With one minute and 75% of infrastructure costs were like, you know, shaving off so, but that was part of the Lucene, you know, munging with the algorithm and changing how it scans the document. It had nothing to do with the level that you go into, you know, with turbopuffer, you know, like effectively controlling the whole process there.

Simon Eskildsen [45:28]:
Got it. Yeah, I think the point there is that I think when we do see that some customers are concerned with disk cache because they've gotten bitten by basically the way that I would think about it is in some of the traditional engines, the way that they do IO, if something is on disk, it feels like it's bad. Like if it's on disk, it's slow, and it really has to be in memory. And so you sort of have, you know, the pufferfish is either, you know, the buffer phase is sort of because when it's fully inflated, it's a DRAM, right? When it's deflated, it's an S3. Well, it only had two settings, right? Either it's in disk, which is quite slow. And frankly, in some of the traditional storage engines, I've seen the latency on disk being similar to our latency on S3.

Dmitry Kan [46:10]:
Yeah.

Simon Eskildsen [46:11]:
And so then you have to load it into DRAM. And what a lot of these traditional databases, they have to do a full copy into DRAM. They can't just like zero copy off of disk. And then the disks are also quite slow, these old network disks, right? The NVMe disks are so fast, right? They can drive bandwidth that's within, you know, a very low multiple of DRAM, right? Tens of gigabytes per second. But the hardware is cheap, and you still can't take advantage of these very easily. You can't just put some software on it and just it's going to be like 10 times faster than an original disk even if it's fundamentally capable of it because what we found, for example, is that we had to remove the Linux page cache because the Linux page cache cannot keep up with these disks. So you have to do direct IO, but when you do direct IO, you don't get coalescing, you don't get all these other things. So now you have to write your own IO driver, right? And so you just, databases have not been built to take advantage of it because they're also like, they're not built to try to do an IO depth, like basically so many outstanding IO requests they can drive. There's much more throughput to be had. So there's just a lot of barriers of entry there. So what we find is that when, again speaking in generic terms here of like, you know, millions of vectors queried at once, when something is in disk, it's maybe high tens of milliseconds, mid, you know, 50, 70 milliseconds when it's fully on disk, maybe lower depending on the query, the machine or whatever. And when it's in memory, it's closer to 10 to 20 milliseconds, right? This is like, these are not, this is not bad. Like the user is barely going to notice it. But of course you're going to get more throughput that way. And then when it's on S3, it's maybe more like, more like five to six hundred milliseconds, so users usually notice. But a lot of our customers, like Notion, for example, when you open the Q&A dialogue and these different dialogues that will query turbopuffer, they will send a request to tell turbopuffer, "Hey, can you start warming up the cache here in a way that makes sense?" And by cache, we just mean putting it into disk and starting with sort of the upper layers of the ANN index and other things to reduce the time as much as possible. So there's a lot of things that can be done here that are very, very simple. Together, that means there's barely a trade-off.

Dmitry Kan [48:26]:
Yeah.

Simon Eskildsen [48:26]:
Let's go back into multi-tenancy, unless you had a follow-up on this.

Dmitry Kan [48:30]:
Yes, let's do that. Like, how do you view multi-tenancy?

Simon Eskildsen [48:34]:
So turbopuffer can run in three different ways. It can run, yeah, in multi-tenancy clusters. That's what, I mean, that's what Cursor does. That's what Linear does and many of our customers. So in multi-tenancy, you share the compute. We could do this so cheaply, right, because we can share the caching, we can share all of this infrastructure. It's very easy for us to run this way. So that's the default mode. The cache is, of course, segregated off in different ways, but is also like shared in ways where if you have a big burst of traffic, right, you get more of the cache than others. So that's what we, so it's a very great way of running multi-tenancy. The other thing we do for multi-tenancy to keep it very secure is that because all the data at rest is in the bucket, you can pass an encryption key to turbopuffer that we don't have access to unless it's audit logged on your side where we can encrypt and decrypt the object, which is logically and from a security standpoint equivalent to you having all the data in your bucket.

Dmitry Kan [49:42]:
Mm-hmm.

Simon Eskildsen [49:43]:
So this is a very nice primitive that, for example, Linear takes advantage of because they have full control over their data. They can see when turbopuffer is accessing it. They can shut it down at any point in time. And they can even pass that on to any other customers where turbopuffer can encrypt data for Linear customers on behalf of the customer with the customer's key. This is like really, really, I think, groundbreaking and underrated in this architecture. You can, of course, do single tenancy with turbopuffer as well, where the compute is only for you, or you can do BYOC where we run turbopuffer inside of your cloud in a way that's like very compliant. We can never see customer data, but we find that the multi-tenancy with the encryption, which can be done per namespace, satisfies the security requirements of even some of the biggest companies in the world.

Dmitry Kan [50:29]:
Yeah, that sounds awesome. I also wanted to pick one topic which usually used to, I don't know if anymore, I don't see that as much, pick up a lot of flame discussions. What is your recall at n? And when I go to the docs of turbopuffer, it says recall at n is 100%. Recall at n, excuse me, for vector search part. So does that mean?

Simon Eskildsen [50:54]:
Not 100%. We said 90 to 100, right?

Dmitry Kan [50:56]:
No, I think it says, wait, wait, wait, wait, I'll need to... What was the page where you do that? Oh, here, the limits, I guess.

Simon Eskildsen [51:05]:
Oh, I see observed in production. Yeah, it should say up to 100%. That's a bug in the docs that I shipped last night. I'm going to fix that after this.

Dmitry Kan [51:13]:
Awesome.

Simon Eskildsen [51:13]:
But what it says in the limits is 90 to 100%. But let's talk about recall. I'd love to get into recall. So I think recall is incredibly important. It's the equivalent of your database. You have to trust your database to do it in the same way that you have to trust your database to fsync and you have to trust your database that when we say that, hey, we don't return a success to you unless it's committed to S3. You have to trust that. Recall is similar, right? If you are working on search and you're working on connecting data to LLMs, then you don't want to worry in your evals on whether your vector database is giving you low recall. It's actually a very sophisticated problem to evaluate whether this is the cause. So you have to trust your vendor. This is an underrated problem. And I love that you're asking about it, and very few people ask about it unless they're quite sophisticated. So let's go into a long answer here for your audience because I think this is paramount. Most databases that have a vector index are trained on or not trained on, but they're benchmarked against for these different ANN open source projects. So there's SIFT and others. The problem with these data sets is that they do not represent what we see in the real world. A lot of them are very low dimensionality. Like when we do benchmarking on a billion that we're working on right now, the biggest data sets we can find are like 64 dimensions. This is not what people are doing in production. They're doing at least 512, often generally I'd say the average is around 768 dimensions. These are not representative data sets. And the distributions in the academic benchmarks are also completely different, really different from what we see in real data sets, right? In real data sets, we see millions of copies of duplicates, right? We see filtering, all these chaotic environments that do not present themselves in the academic benchmarks. So if you're using a vector index that's only been tested on academic benchmarks, it's, I mean, it's like the LLMs, right? It's like you don't really trust it just based on the scoring. It's sort of, you, it's all the vibes, right? It's all the qualitative thing, right?

Dmitry Kan [53:21]:
Right.

Simon Eskildsen [53:21]:
Outside of the benchmark, because it will work for your domain, right? Like the LLM.

Dmitry Kan [53:25]:
That's right. Like early on, very, very early on in turbopuffer's history, in the first month, I was mainly iterating against the SIFT data set, right? Just like 128-dimensional data set. I didn't know anything about ANN at the time, just like, okay, this is pretty good. We can tune some specifics on this, and then I can go wider. But I have a feedback loop. And the observation I had at the time was that I found that one, so I got something that worked really well, great heuristics on SIFT. And then when I went on the other data sets, it just completely did not work well or generalized to the other data sets. And I think that taught me an early lesson that these academic data sets are just not enough. And the only way to know what your recall is going to be is to measure it in production. This is what turbopuffer does. For a percentage of queries, it depends on the number of queries that you do, but let's say around 1% of queries, turbopuffer will run an exhaustive search against the ANN index on a separate worker fleet. We will then emit a metric to Datadog that is the recall number, right? Like, which is basically, okay, this is the top 10 that we know is accurate versus the heuristic ANN top 10, and you measure the overlap. And we will average that over time. I have a graph in Datadog that shows all the different organizations that have more than 100 queries in the past hour or whatever. And then we have the recall for all of them. We have the recall at whatever k they ask for—the @10 recall, the p90 recall, and we try our best to make sure that this is green at all times and be considered green. Anything above 90% is generally quite good. Well, 90% is quite good for some queries, but for simpler queries, often it's closer to 100%. Many of our customers have 99.5% recall. So this is the only way that we know to do this. And it's funny you asked this question today because last night I was hacking on putting this into the dashboard. So literally putting the recall that we observe from this monitoring system into the dashboard of the user because we think it's that important and it's very difficult to get right. We have spent thousands of engineering hours to make sure that the recall is high. Now recall on academic benchmarks, easy. Recall on raw ANN search, especially on academic benchmarks, very easy. Raw recall on production data sets, I'd say medium to medium hard. High recall on ANN queries with filters, with mixed selectivity and incremental indexing, absolute hard mode. This is what you just slap a secondary vector index onto an existing database. This is what they can't do. They can't sustain like a thousand writes per second with high recall in the face of very difficult filter queries. So let's talk about filtered recall for a second. There is barely any academic data sets on this, yet it's all the production workloads. What a filtered ANN index means is that let's say that, for example, you have an e-commerce store and you're searching for, I don't know, yellow, right? And you want to only get things that ship to Canada. That cuts the clusters in different weird ways that might end up with a selectivity of 50%. And so if you just visit the closest whatever vectors with some heuristic you have, you're not going to get the true ANN because you actually have to search maybe twice as many, maybe three times as many vectors to get the right recall. The query planner, the thing in the database that decides where to go on disk and figure out the data and aggregate it all together to return it to the user needs to be aware of the selectivity of the filter and plan that into the ANN index. Again, if a database is not really serious about their vector offering, they're not doing this. They're not measuring it in production. They're not willing to show their users, and they don't have a full infrastructure in place to measure the recall. So I'd say we take this extremely seriously, and we don't want our users to have to guess this. And it's sometimes a thankless job because many, many, many emails that we see against some of the other vector indexes have very low recall, and how are users supposed to know? Because running these tests is extremely difficult.

Dmitry Kan [57:59]:
It is, and it's like, it's as you said, like you need to trust there, right? Trust your vendor, and it's basically like the, like in some documentation pages you say the floor or like the bottom line, right? Like beneath which it just doesn't make sense, right? If the quality isn't there, then why are you even running this? Yeah, like it's a difference between, you know, finding that product with those constraints when it exists and actually not finding it, right? Therefore not buying it and so on and so forth. It's crucial.

Simon Eskildsen [58:30]:
And I think you can never guarantee a recall. You can observe what you are trying to make it be on every data set, but if you send a billion completely random vectors with 3,000 dimensions and then hit them with queries under 10%-selectivity filters—and there is no natural clustering because they're random vectors, you're not going to get 100% recall. That just like completely breaks every heuristic that's made, right? But all data in production, real data that people want to search has some natural clustering to it. So that's not a real benchmark that you can evaluate recall on, right? And so we always take this seriously, and in POCs and with the monitoring we do, we're looking at these numbers all the time. But there are like absolute edge cases that can be very, very difficult. What you have to do too as a database vendor is like it's a tug of war between we're going to look at more data to try to get high recall and we're going to try to improve the clustering of the data so that we have to search less data. And so you're always trying to improve the clustering, and you're always trying to improve the performance of the database so we can look at more data to get high recall.

Dmitry Kan [59:37]:
Yeah, for sure. And now that you mentioned filtered-search challenges—Big ANN is another thread. I don't know if you're aware: there's ANN-Benchmarks, right? But there's also the Big ANN Benchmarks suite that I happen to have had the pleasure of participating in. They have one of the data sets, one of the tasks they have is the filtered search. I have not participated in that one. But again, as you said, it's kind of like academic, but some of the data sets are quite large, you know, like billion points, dimensions are not that huge—on the order of a couple hundred.

Simon Eskildsen [1:00:11]:
That's the thing—they're often only in the 100 to 256 dimension range, not the 512 or 768 you typically see in production.

Dmitry Kan [1:00:13]:
Right. They are real data sets, but they're from the past generation of vectors—the pre–modern embedding era.

Simon Eskildsen [1:00:35]:
Modern embedding models behave so differently on real workloads. We just don't see people rely on those older benchmark setups in production.

Dmitry Kan [1:00:35]:
That's right.

Economical way of serving vector search workloads

Transcript

Guides

API Docs