The infrastructure company powering the top AI apps

July 22, 2025•Unsupervised Learning

Transcript

Jacob Effron [0:00]:
turbopuffer is an incredibly fast-growing vector database and search engine, powering some of the leading AI applications out there, like Cursor, Notion, Linear, and more. Today on Unsupervised Learning, we sat down with their CEO, Simon Eskildsen, who had a fascinating career prior to turbopuffer, spending a decade at Shopify working on their hardest infrastructure challenges. We touched on a lot of things in today's conversation, including how builders should think about using different databases. We talked about the evolution of the vector database space over the last few years and why now is a particularly exciting moment. We hit on Simon's takes on the general AI infrastructure landscape and what areas will persist. And we also talked about what he's learned from building so close to the forefront of AI apps today. This is an episode I think folks are really going to enjoy. Without further ado, here's Simon. Well, Simon, thanks so much for coming on the podcast. Really excited to have you.

Simon Eskildsen [0:41]:
Yeah, thanks for having me.

Jacob Effron [0:43]:
And get to do it in Ottawa. It's a treat to take the show on the road.

Simon Eskildsen [0:46]:
Yeah, thanks for coming.

Jacob Effron [0:47]:
I was thinking about a few ways we could start. One is it feels like in building AI applications, everyone is incredibly focused on connecting their business-specific context to their applications. And I feel like throughout the last few years, there's been a bunch of different ways folks have started to think about doing this. One has been, let's just stuff everything we can into a context window as they get bigger and bigger. Another has been using kind of net new databases. Talk about the origin of what made you convinced there was a need for a new search paradigm here.

Simon Eskildsen [1:17]:
Yeah, so I think when I started working on turbopuffer, the context windows were very small, and so that really got the initial wind in the sails of the first few vector indexes, right? The context windows were maybe 8k, very small, and so a lot of the applications that started needed retrieval-augmented generation (RAG) basically right away. Like most, even most articles are larger than 8k tokens, so that really got early wind in the sails. But then the context windows got very large very quickly, right? They came up with these windowed mechanisms and stuff like that to try to make the context more useful. It's still quadratic; there's still a lot of tricks. It's not perfect recall at the end of the day. There are very few datasets that actually do any kind of reasonable question answering on something the size of a Harry Potter book, so it's actually very difficult to train a good model to do this. But either way, that's sort of got the initial wind in the sails. But then it felt like the wind kind of got out of the sails a little bit because the context windows got really large. And the first few applications that were coming online were able to just abuse and stuff the context windows and run for product-market fit (PMF). And that's exactly what I would have done too. But what we're seeing now is that even with a million context or 10 million context, let alone the latency, companies want to connect all of their data to LLMs, right? That is the new workload that we see today. And in order to do that, you often have more than a million tokens; you have tens of millions of tokens; you have tokens that have permissions attached to them. You want very high recall, which even in a very large context window, you can't necessarily guarantee that it can be there. Again, back to it's very difficult to train a model on this kind of thing, right? So I think now what we see is that companies want to search large amounts of data as they have found PMF with the large context windows, but they demand economics that make sense for them. What counts for them is that the per-user economics of storing all of this data have to make sense for the value that they're delivering on the other end. And on a traditional storage architecture, where you're storing every byte on three disks, the per-gigabyte costs end up like north of a dollar, right? Per logical gigabyte, and we can do it at a much, much lower price with object storage and with a set of trade-offs that make really good sense for search that allows searching on, you know, not a million tokens but hundreds of millions or billions of tokens for a particular user or application. I have a little acronym that I throw around when people, when we talk about content.

Jacob Effron [3:55]:
We love acronyms.

Simon Eskildsen [3:55]:
It goes like this for databases.

Jacob Effron [3:56]:
Yeah.

Simon Eskildsen [3:57]:
I call it SCRAP, and it's basically like a couple of different, and I just use this when I, you know, otherwise I forget something. The first S stands for scale, right? So it's a particular point you will outgrow the size of the window. Even if the context window gets to 10 million or 100 million, right? People are just going to demand more; that's what tends to happen with computers, right? But there is some sense where, okay, if I'm doing an analytical query, should I load a billion rows into a quadratic attention window, or should we have some auxiliary system, right? It feels like even an AGI-level model would build a system to do some of this stuff right on what's already there. So that's the first one. The second one is C, so that's cost. The cost of executing a very large context window is substantial. You have to store that in VRAM, and VRAM is one of the hottest commodities on Earth right now. It makes a lot of sense if you're doing a lot of queries against something, right? Some places will make sense to just have everything in a context window and just execute against it, and the economics of VRAM make that work. But in a lot of cases, you can't afford to keep all of this in VRAM, and you can't even really afford to keep it in DRAM, let alone even on disk, right? Which is why this architecture of storing things on object storage can make sense. The next one is recall. So recall in a very large context window can be difficult.

Jacob Effron [5:17]:
Yeah.

Simon Eskildsen [5:17]:
There are a lot of benchmarks that are sort of needle in the haystack, and they do fine on that because it's easy to train. But actually having datasets that are good at reasoning over very large contexts, large corpora, and just filling an answer, there's not a lot of them, at least that I know of. Maybe they're behind a paywall, so I don't have access to them, but there's not a lot of datasets for that, and you have to play a lot of tricks to make this work. The next one is ACLs, right? I don't think anyone trusts the context window enough that you can stuff in and say, well, you know, Jordan and Jacob have access to this document, but Simon doesn't, right? Can you just like, you know, pinky promise that if Simon is the one asking, that you won't do it? There will probably be a point where we trust them enough, but at present we don't, right? We sort of like put some UIDs on a document and some UIDs in the query and make sure there's at least, you know, one overlap in the set intersect to make sure that the user has access to it. And then the last one is performance, right? It's difficult on these frontier models to load in a very large context window and get a response in less than a second, which is what an engaged consumer would want. So those are some of the reasons that we see among our customers of why they engage with allowing their LLM to access the data in a different way than just putting it all in there. But of course, there's some optimal function between how much goes into context and how much is retrieved into it that's idiosyncratic to the company.

Jacob Effron [6:39]:
It feels like so much of this is connecting massive amounts of data, and it feels like in the early days of people looking at some of these net new databases, you know, they're really making up for the fact that the context windows were small. And I remember when we were looking at some of these businesses two years ago, there was just a desire to find like who are the customers that are connecting massive amounts of data. Maybe two years ago there weren't that many, honestly, that were, and now it feels like that is actually starting to change. And you know, you're seeing a bunch of different use cases on top of your platform with folks that are bringing tons and tons of data to actually connect into the models where you need something like this.

Simon Eskildsen [7:07]:
I think that's exactly right. In the beginning, the first generation of these databases and most of the research was really around doing vector search in memory. And when you have an 8k context window and you're trying to make up for that with maybe, you know, accessing 32k tokens, the economics are not so bad; you're not really going to feel the pain yet. Of course, we've done inverted indexes, which are what you use for a keyword search on disk for a very long time. But none of this has been tried on something like object storage. But I think that now, yeah, we're seeing that the frontier really is to try to reason over very large datasets. And I think I only develop more conviction as I saw the frontier models start to use so much time querying data, right? They do use a lot of time searching the public web, and increasingly the models also have access to private data. I think every company is going to—there's going to be a baseline expectation that you can fire off some research function, which I think are the agents that people use today that are the most useful, right? Hey, go out and compile a report on this.

Jordan Segall [8:24]:
I'm curious on the object storage architecture side. We're big fans of it, by the way. We just released an InfraRed Report and highlighted turbopuffer as an example. What are some of the trade-offs of actually using that architecture, and what are AI workloads that maybe it wouldn't make as much sense for?

Rashad [8:38]:
Hey guys, this is Rashad. I'm the producer of Unsupervised Learning. And for once, I'm not here to ask you for a rating of the show, although that is always welcome. But I would love your help with something else today. We're running a short listener survey. It's like three or four questions, and it gives us a little bit of insight into what's resonating and how we can ultimately make the show even more useful to you, our listeners. The link to the survey is in the description of this show. I promise you it takes like two or three minutes, and it's a huge help to us. We're always trying to make the show better, and so this is one way of supporting. And yeah, that's it. Now back to the conversation.

Jordan Segall [9:18]:
What are some of the trade-offs of actually using that architecture, and what are AI workloads that maybe it wouldn't make as much sense for?

Simon Eskildsen [9:21]:
To back up a second here, I think it's important to take the history lesson of why this is now possible because I think there's a lot of—

Jordan Segall [9:29]:
Love why now as VCs; that's a classic.

Jacob Effron [9:30]:
Classic.

Simon Eskildsen [9:32]:
So I'll entertain that for a second because I think that there's a new storage architecture here, right? Richie and team did it for WarpStream, right? Where we started using this for streaming. The observability providers have done this for a long time. And of course, the OLAP companies that came out in the 2010s have also done this for a while. So what's changed in the past five years that makes something like this possible? So I think the first thing are NVMe SSDs being so good. I think they got good very quickly, and I don't think that any new database engine has fully been built to take advantage of how much bandwidth you can drive from these disks, right? If you go to a cloud and you look at the cost of these NVMe SSDs based on the instance, you can get them for somewhere around five to eight cents per gigabyte, okay? But you can drive often these disks within spitting distance of DRAM bandwidth, right? So you might be able to drive like 10 gigabytes per second to these disks. DRAM on that machine might be 100 gigabytes per second. But they cost two orders of magnitude less, right? A gigabyte of RAM in the cloud costs somewhere around $5, right? It's like two orders of magnitude less in cost, but the bandwidth difference is only 10x. This is a very interesting trade-off, right? And that wasn't really available on the network storage that's always been available in the cloud. So this performance is a very interesting characteristic, and you have to write your storage engine to really take advantage of this, right? io_uring is only something that's come out recently. You can't even use the Linux page cache because it's too slow to keep up with these drives. So these only became GA in AWS and GCP and others in the late 2010s, like I think around mid-2017 on AWS and maybe mid-2016 on GCP, right? There was like one SKU that had these disks. So that's the first thing. The second thing is that S3 only became consistent in December of 2020. This is kind of a mind-blowing fact to me when I learned about it, but what this means was that if you put an object and then you read it immediately afterwards in another request, that you know that what you just wrote is what you're reading back. That only became a core primitive guarantee of S3 in December of 2020. That's a very nice thing to have if you're building a database on object storage. You can live without it, right? Snowflake and others have built big metadata layers so that you can work around this. But it's really nice to have if you want to write a database fast. The third thing that you need is that you need compare-and-swap. And what this basically means is that when you're building a database on top of a file system, right, you have a bunch of metadata of, hey, the most recent version of this is here. And so you need this metadata, and you often have multiple writers that are writing to that metadata. In our case, it would be multiple nodes that might be contending on a piece of metadata. So you need some synchronization primitive. So past database systems that are distributed systems will have a Zookeeper or Raft layer or something like that to contend for this control for multiple writers. But S3 actually only about six months ago, we're recording this in the summer of 2025, released compare-and-swap at re:Invent. Now, every other object storage actually had it prior to that, namely GCP, which we started on for exactly that reason because we had conviction that this was going to become a utility function. Any database that wasn't built this way would feel archaic in five years, but that was the last piece of the puzzle. So NVMe in 2017, S3 becoming consistent in 2020, and then finally compare-and-swap in 2024 allowed you to now build a database where everything is on object storage. And so I felt that you could build a database that had these object storage trade-offs that work very well for search. It doesn't work for everything. So back to your original question, kind of building up to the answer here. What are the trade-offs of building a database on object storage? Well, first, it had to be possible. It's now possible because of these three primitives. And then we have to consider the trade-offs. And so the trade-off is that every time you write, we have to commit to S3. The p90 for such a write, depending on the size, is maybe around 100 to 200 milliseconds. So every time you write, it takes 100 to 200 milliseconds. Now, if I'm writing a checkout system at Shopify scale, then that's too long, right? We can't wait for every transaction to commit that long. But if you're building a search engine, it's usually fine. You update a product, and it takes 100 milliseconds or 200 milliseconds for it to be updated in the search engine. Well, that's an acceptable trade-off in a lot of cases. Fundamentally, this is really the main trade-off. Like everything else, I think in terms of cost, in terms of flexibility, in terms of scale, in terms of simplicity on the system are some of the upside. You could state another downside would be that occasionally you will have a cache miss, and you have to read from object storage, but there's nothing that prevents you from having it always on and hot as you would in a traditional storage architecture, even on multiple nodes, if a node was to go away. The high write latency is really the fundamental trade-off of this kind of architecture.

Jordan Segall [14:48]:
And as you think about the implications of that then for AI app builders and when they should reach for turbopuffer and when they shouldn't, like how do you think about the types of things people are trying to do in applications where this trade-off doesn't make sense?

Simon Eskildsen [14:58]:
Right? So, I mean, turbopuffer is mainly a search engine, first of all, right? So if you're modeling all of your data and user permissions and all of that, you should probably use a relational database. Nothing's going to make that go away. And the transactional performance of that, the flexibility, and all of the know-how, right? So I'm talking here from a storage engine perspective. Where something like this architecture makes sense for your search engine, right, is that if you have a million vectors and you tag on a vector search extension into your relational database, great. That's what I would do. But if you have tens of millions, hundreds of millions, you have maybe billions or hundreds of billions in the crosshairs. Well, at some point, the somewhat idiosyncratic trade-off of adopting another database and ETLing into it is going to start making sense, right? Because at some point, you can't escape the economics of, well, if you use a Postgres extension, you have to replicate it to three disks. And these disks cost you $0.10 per gigabyte, 50% utilization, all in about $0.60 per gigabyte that you store. You can't really escape the economics of that. And the vectors are large, right? At kilobyte, the tax easily turns into 30 kilobytes of vector data. So where something like turbopuffer makes sense is when you're searching over large quantities of data, right? And I mean, since the beginning of time, people have taken out the full-text search workloads from the relational database. Once it reaches a certain scale, vector search is even more brutal on these transactional engines in terms of both the cost and how much it can overload the box, and these are the reasons why we've taken FTS out for a long time so that would be another bottleneck, right? As an application gains success, you have to start taking out some of these pieces from the relational database before you have to shard it.

Jordan Segall [16:47]:
What are some of the most common set of use cases you are seeing people build on top of turbopuffer? I mean, is that some of these deep research-type use cases on top of tons of data or like as varied as the landscape is out there?

Simon Eskildsen [16:57]:
I think, I mean, we could go through some of our customers and how they use it. So Cursor was actually the first paying customer on turbopuffer.

Jordan Segall [17:07]:
Pretty good first customer.

Simon Eskildsen [17:07]:
Yeah, seriously. And they've been wonderful to work with and true design partners in every sense of the word. And their use case was to—they want their agents and all of their features to be able to do semantic search over a code base, right, or over multiple code bases. And that's what they use turbopuffer for. So if you've opened a code base in Cursor, then that code base is indexed into turbopuffer, right, like into vectors that are completely obfuscated and encrypted. And then they can use RAG over all of that to try to draw in more context. I use it all the time because, you know, you can ask in the chat, you can ask it something like, "Hey, what's the function that does this thing?" Like this morning I was asking, "What's the function we have that formats a number so it's not a million, like, you know, five zeros after or whatever?" But it's like it does some—and you just ask that in free text because you don't remember what it's called. That's the kind of thing turbopuffer is really good at, right? And you will see the Cursor agents make these kinds of queries. So that's one, right? So they're connecting code bases to AI, sometimes very, very large code bases or multiple code bases. Notion is another one of our customers, so they have a Q&A feature, and increasingly this is making it into more and more of their canvases where you can ask it, "Hey, what's the leave policy?" or "Hey, someone passed in my family, like what are my options?" It's used for an internal wiki; it can do research-type things. So they use it for that. And often the way that a person thinks about something and the way that it's written in a document are different, right? Like you search for "red dress" and they have a "burgundy skirt," like that's the kind of thing vectors are really good at. And so Notion uses that to draw context into the LLM to answer questions. Linear is another customer of turbopuffer. They use it for their search, and they also use it for similarity. So, "Oh, this might be a duplicate issue of this one," or "This might be the person who should work on this type of issue." I think they might also be using vector embeddings for that. So those would be some of the use cases that we've seen.

Jordan Segall [19:13]:
Here's, you know, we talked about why traditional databases aren't a good fit for sort of the SCRAP model that you talked about. What about incumbents like Elastic sort of going after and combining vector search with traditional search like you are?

Simon Eskildsen [19:25]:
If your per-user economics allows you to store everything in memory, then, you know, fine. That would make sense. Or on disks. I think traditional storage architectures, you know, they have lower write latency and they may have maybe more features, right, because they've been at it for longer. I think it makes sense for certain scale. But if you have the ambition to search potentially trillions of documents or billions of documents, then the cost might not make sense for your application, right? And you might have to start upping the price charges on your users in order to have this available, and you know your market better than us, right? And we just try to price against the first principle cost to us.

Jordan Segall [20:08]:
You know, I think you're one of the world's experts now in building object storage architectures. Any tips on just building that out for folks that are looking to?

Simon Eskildsen [20:16]:
One of the things that as an engineer continues to surprise me is how far simplicity goes. Now we have very high conviction that keeping everything on object stores, including the metadata, was the right decision. But really, our hand was forced a little bit early on because we had some customers that were growing very fast, and we didn't have time to introduce a separate metadata layer. And we thought that maybe people would care a lot about very low write latency. But again, we gained high conviction that this is actually fine. And the trade-offs and the simplicity that we get out of committing directly to S3 were there. So I think that let object storage surprise you. I think that there's a big bag of tricks that you started leveraging to build really scalable systems on top. The other day, for example, of course, we have many millions of prefixes on S3, and we sometimes have to go out and do various background activities on them, so then you have to go list them, right? The S3 API is like you start from A and then you list, and then you go and paginate through, and it would just take forever because for S3, this I think is sort of a brutal operation for them, so it takes a really long time. And so what we started doing was just have a read-through cache where every few pages we would just put in a text file what that prefix was, and then every time now we can sort of like start listing the buckets at all of these different keys in parallel. So there's lots of these tricks that you can use, but you like—I don't think I can explain to you how many hours I've sat and just looked at the S3 API to try and come up with something, right? Like another thing I'll say is like 404s are free, and you can use that to design systems, right? There's the compare-and-swap primitive. There's all of these little headers that you might not have paid attention to that allows you to build a system like turbopuffer with very, very low latency and very high durability. So I think we have a big bag of tricks. At some point, we should lay out the 16 tricks in the bag. There's nothing secret there other than just spending a lot of time looking at the S3 APIs and the GCS APIs.

Jordan Segall [22:27]:
And then we talked about sort of use cases that are a good fit for turbopuffer and not as much of a good fit. What do you think are sort of unsolved problems today within vector search?

Simon Eskildsen [22:35]:
So the hardest things about vector search is to keep the index up to date. So when you do vector search, the only way that you can guarantee that when you do this query that the 10 results you're getting back are exactly the 10 results that are closest to this vector is actually by looking at every single vector, right? It's an unsolved problem to have a non-O(n) algorithm that can do that. So we approximate, and we approximate with a number that the listeners may have heard of referred to as recall. Recall is basically, okay, these are the actual results I got back from my vector index, and here are the exact results. What's the percentage overlap? So if there's one result that was wrong compared to the exhaustive search, then you have 90% recall. Otherwise, you have 100% recall. We find that our customers feel really good at around 95% recall. That's sort of when you don't have to guess whether your evals are not performing because of your retrieval layer. And so in the background, even for a percentage of queries, we've run exhaustive searches against them and report them back to Datadog. So we always have an idea of what's actually going on in production. When you incrementally maintain this index, you have to make sure that the recall is good. And if you built the index, you know, knowing that, okay, I have an e-commerce store and they're selling pants and they're selling dresses and they're selling shorts, and suddenly they start selling shoes, well, which—like, how do we put this out in the vector index? Because the vector index sort of has these clusters around different things. And so maintaining high recall as you continuously update the data is very challenging, very, very challenging. In the beginning, in the first version of turbopuffer, once a certain percentage of the dataset had been overridden, we would rebuild the entire index. That's what most production implementations do today. But we found this very, very challenging and expensive to scale. So we spend a lot of time trying to implement an algorithm that will incrementally maintain all of the—like this ANN index, and this now works into the hundreds of millions of vectors and even into the billions, and we're trying to push as far as we can incrementally maintaining with high recall that ANN index because sharding is the only way to scale anything. But sharding too early is a cop-out, right? And on some of the traditional search engines, the shard sizes are maybe around 50 gigabytes, and you know, looking up an operation is log n or whatever it might be on. But if you have to do log n for a thousand shards, it's a lot more expensive than doing log n on five shards. So you want to make these shards as large as possible, and so incrementally maintaining high recall on a vector index at high ingestion performance in the hundreds of millions or billions for a single shard is a very challenging problem. The second challenging problem around vector search is filtering. So when you send a vector query, you are not just saying, "Hey, what's close to red dress?" Oh, it's the burgundy skirt. You're often like, "Does it ship to Canada? Is it red?" Like, you know, you have all these like real filters, right?

Jordan Segall [25:51]:
I love all these Shopify examples, but okay.

Simon Eskildsen [25:53]:
It's just, you know, I—

Jordan Segall [25:55]:
DNA at this point.

Simon Eskildsen [25:56]:
I grew up here right here in Ottawa is where the company was founded and the infrastructure was built, and it shapes your worldview. But anyway, so you do these like actual hard filters on top of this more fuzzy vector search, and that can be challenging because that's a different type of index that you need to use for that. And you can use all kinds of techniques to make sure that the recall is still high in the face of these filters because, of course, you know, if you search for something that's the color red, that might match a small percentage of the dataset, and you can just search all of that, so that's very easy. It's called a pre-filter. If something is—if you're checking for whether a product is public, well, probably 99% of them are public, so you just over-fetch a little bit and then you remove the others. But something like "ships to Canada," which maybe applies to 50%, that randomizes around the clusters, and you're trying to get a banana that ships to Canada, there might not be any produce that will ship because that will be prohibitive. So the clusters that are closest to the banana are just completely off. So what it ends up being is like maybe one of the clothing clusters, and it's like, you know, fruit-themed. I don't know, you know, just—and so providing high recall in the face of this really requires that the query planner that decides how to execute the query is aware of the filtering and is aware of the vectors. These are the two hardest problems. And I mean, like, capital H hard.

E [27:19]:
I mean, one thing I'm struck by, and obviously there's just so much nuance in solving this problem, is you think about this like zooming out this general problem of connecting data to LLMs. You know, there's a bunch of parts of that process that all have their own problems, right? There's like getting the data, you know, to the database in a way that works. There's obviously the embedding models themselves and how people do that. And I'm sure some people building in this space have said, "Okay, like, you know, customers like to have a one-stop shop, or we should, you know, build more of these things." It feels like you've been super focused on really just nailing this part of that entire process. Like talk a little bit about how you think about that and you think about like the future of turbopuffer across those vectors.

Simon Eskildsen [27:53]:
We've talked about simplicity a bunch. It's a very core cultural value, I think, both of the company and of Justine Li and I. Anyway, so simplicity and focus go hand in hand, and we felt that the hardest problem that our customer was having was not choosing the embedding model; it was not running the embedding model; it was not running the re-ranker; it was, "Hey, I need to store petabytes of data, and I need to search it." And so that's what we're focused on now. Over time, what we're starting to hear from our customers is, "Hey, for me to ingest 100 million vectors very quickly into turbopuffer, it would be really great if you could help us with that," right? So we wouldn't close—we're by no means not saying we will never do it, but we are very focused on just this because as soon as you have to start running the embedding model, then it's like, "Okay, for low latency, you might want to run it on GPU." Now you have to sort of get into that game and sort of the quant GPU game, right, that rules today's world. So we want to make sure that the things that we put our name behind, we're doing a really, really good job. I take the responsibility and Justine also takes the responsibility very seriously of people trusting us with their data. And we know that if we mess up, we're not the only ones that get woken up; our customers also get woken up. And every single decision that we've made around how we've designed this has been that we don't want to get woken up. I was on the last resort pager of Shopify for almost a decade, right? I have sat there at 3 a.m. debugging databases completely alone in the dark so many times. We know what that feels like, and we don't want our customers to be in that position. So reliability is number one. And if you put reliability as number one, then you have to put simplicity right up there as number two. And then when you start to bring in all of these other things, you better have gotten that right on the core product. So we take that very, very seriously, and it's part of—it informs everything around how we do it, right? The only stateful dependency that turbopuffer has is object storage, right? You can blow away all the nodes, and all of the data is safe, right? You can blow away everything, and things are fine, right? And routinely that happens in production, that things are blown away and autoscale down, whatever, and everything is fine, right? And no one notices.

E [30:26]:
Obviously, there's the ingestion and embedding. Is there other stuff that you think about like existing in this set of problems that eventually or customers are coming to you and asking about or a set of problems you eventually might want to add in?

Simon Eskildsen [30:37]:
Yeah, I mean, we'd like to just see what our customers actually use out there, right? It's like hard enough to find PMF on one product. So let's make sure that we double down on what's working. I think we see our customers asking us for advice on how should I do eval on search results? Our customers ask us, "Which embedding model should we use?" And we have internal reports, right, on which ones are fast. And what we hear, and so we can pattern match in a respectful way across what we see our customers doing. There are some embedding models from some of the labs that take 300 milliseconds to do a query; that's prohibitive for some search; that's too long. turbopuffer takes 10 milliseconds; it takes 300 milliseconds to create the vector; it's not acceptable. So we want people to use fast embedding models so that they don't get painted into a corner. Rerankers, the same thing, right? I mean, I worked on search at Shopify, and we see what others do in search here. And so we just help our customers. But in general, what I have seen and saw at Shopify as well is that in the traditional search engines, you end up with a massive DSL where you're expressing, "Hey, multiply the title, BM25 results with this and that and this field and then a little bit of the image," and it sort of becomes this like very finely massaged thing. And generally, it's written—someone sort of loaded that context into their head and then executed and got good evals, and then no one touches it for years because, you know, now you're addicted to what happens when you type in a set of characters and navigate. Have you ever tried when the search engine changes on you, and you just like understand how bought in you are to this? But it's very difficult to maintain; it's like thousands of lines of JSON or whatever. And so I think right now what we're seeing is that vectors make up for a lot of that. Again, coming back to this red dress, burgundy skirt example, well, we used a lot of effort in PhDs before to turn strings into things, but that's just, you know, you cut the head off of the LLM, right? And these numbers are exactly that. So we find that a lot of these features are not needed, and we find that our customers actually really like to just write a search.py or search.ts or whatever they're using and doing a bit more of this themselves. There might be, you know, one or two milliseconds of performance penalty, but fundamentally there's really not much, and you gain control. You can write tests. That's right. It's easy to write evals. So I think as we find that our customers want more of this and particular things where from first principles, it makes sense for that to be in the engine for performance reasons, then we will do that. Like something we're starting to see is that people want to do late interaction where you're often issuing something like 128 vector queries in one search query. And so that's maybe a little bit too much to funnel down over JSON. So we have to help with some APIs around that. So we're always paying very close attention to it, but we take the same stance as we always have. We don't, you know, European cities are beautiful because they're built incrementally, and software is really the same thing. When you start to guess too much about what people might do with the software, you end up with these, you know, like we're in Ottawa; it's like what, you know, it's not a particularly beautiful downtown where it's just like a lot of things needed to happen very fast, and so you build a lot very quickly. And I think a lot of bad software is written that way. You make too many assumptions about how people are using it, right? Look at like we didn't think that people were going to love the 100 to 200 millisecond write latency, but it turned out to be fine.

E [34:08]:
Does the approach to this kind of focus and gradual building, does that lead you more to then the cursors and notions of the world? One thing I've been struck by seeing infrastructural players kind of sell into large enterprises is I feel like they'll come with, "We do this part of your broader RAG solution," and the enterprises are like, "No, no, I want one thing to do end to end," you know, for me today. And I feel like a lot of, you know, in talking to a lot of insurance companies, they get dragged into other parts of the stack because that's what folks want to buy.

Simon Eskildsen [34:34]:
Yeah, I think naturally, right, everyone starts to bundle at some point, and you start to—you hear you see commonalities between your customers, and you try to help them get there faster. But you can also get too greedy, and you can start to do too much of this too soon at the cost of the focus of the team, right? What made it work in the first place was that you even have customers asking you for this, which is a blessing. So we think about it, and I expect us to like partner and bless and work on things to make the end-to-end process much easier for our customers. And yeah, it certainly takes a lot of discipline sometimes to say no and then say yes to working on more performance, more cost reduction, whatever it might be, right, on the core things that we want to get good at. There's a grab bag of ideas, and it's rare that a new idea enters the grab bag that we haven't already spent considerable time and effort thinking about. What is always difficult is to decide when to pull something out of the grab bag and to continuously do that Bayesian, you know, gradient descent on, okay, what matters this month, right? And continuously changing the priors based on what the market is demanding and customers, right? And not saying no forever. And we do exactly that process internally as well.

F [36:33]:
But I think this has been one of the hardest parts of building AI infrastructure companies generally is that, you know, with the underlying model layer changing so fast, the way people are using these models evolves so fast. And so it feels like you're playing a bit of whack-a-mole of like, "Okay, we just solved for the way people are building things today," and then in three months, they're actually building things in a very different way now. I think, you know, arguably the space you're building in might be the most isolated from that because it seems like at any capability of model, this will be required. But I'm curious as you zoom out and think about other parts of the AI infrastructure stack, like does that resonate, and how do you think about what other persistent parts of it will be versus, you know, things that seem more moment in time?

Simon Eskildsen [36:33]:
I like companies that have state. And that's why we built a stateful company. I think that there would be a lot of commonalities, right, of what would be in there. And I think that to build a good company as part of the AI stack, you either want to come up with a really good workflow, like something that makes—and lots of good companies have been built around workflow. Generally, workflow companies eventually start capturing some state as well. And generally, the stateful companies also start capturing workflow. I think that if you try to do all of that at once, you need a lot of years of R&D in a lab before you go to market. And I don't think anyone has time for that right now. So I think it's a very interesting time to be an infrastructure company because I think that among the frontier, they are picking and choosing and doing the Lego thing, and they're being very careful about what they're choosing, and they want the best in the world for every particular piece, right? And so then there will also be companies that try to bundle everything, but it will come at some quality trade-off for individual pieces. This market is different, right? In that, like, I feel like there's always some demarcation in the generation for different companies, right? It was very clear that a new era of companies happened in late 2022, right? It was clear that during this era, there was a particular set of companies, right? In the 2010s, which is where I feel like I grew up in the software world, you have the Stripes and the Shopifys and the GitHubs, Zendesk, and Pinterest, and so on that in some ways were also felt similar but also, you know, wonderfully different. And now we're in yet another one, and I think that it has various attributes, but I think that we are seeing that some of the ones that are doing very well are very specialized on what they're doing, right? They're taking the niche that they're good at and they're doubling down on it. At some point, we will see them bundle, but it might take a little longer than it did for the previous generation of companies because all the great generational companies have a very strong finger on the pulse of what's happening and what people want right now, and it's very informed by the customers.

F [38:47]:
One area that's getting a lot of interest now is memory on the AI infrastructure. What do you think the future of memory is? And then what is the role of turbopuffer, you think, in that space?

Simon Eskildsen [38:47]:
Yeah, I think memory—people will always start by playing around with the simplest thing, right? And so if you're using agents today that do memory, there's sort of—there's the memory within the context of a very long encounter, right? So if you're working with a coding agent, you might be working with it for a long time. And at some point, it sort of has to compact. And of course, right now for that kind of thing, a lot of you just use either a text file or you just ask the LLM to compact. And that's where it's going to start. Then we see them start to bring in memories sort of laterally, east-west to that chat itself. It's like, you know, my ChatGPT is very confused because I shared with my wife, and so there—but it's generated all these memories, right, of how do you grow this flower and what do you do about this pest, which is, you know, my wife. And then it will draw in, "Oh, since you're a good Rust programmer, here's like a script to get a weather and humidity report for your flower." Like, it's like, so now we have to split the account. But anyway, so those memories are lateral, and they're not—it's not a lot of data, right? And so even if you just pull them all into memory and did some similarity or pulled them all into context and they're condensed enough, that's probably fine. Are we going to start seeing memory at a scale where you have to start doing a lot of RAG over it? I think that we do see some of it. So one of our customers is this company called Portola, and they built Tolan, and Tolan has very long-standing conversations with their users. And these are not just memories, right? They're long, long chats. So there's also this sliding window between—or this slider between what's a memory versus just like searching in all prior context. I think similar to what you've heard me talk about before, we haven't seen enough patterns emerge here among our customers that we're going to—that there's anything in particular that we have to ship, right? You could also just put all of this into turbopuffer, and it's just like a key-value store on object storage, and it will work great. You don't need to use vector search or just—you could just use the keyword search as well. But I think it's TBD exactly what this looks like. I don't think—I think some implementations, it's not a lot of data, and some it is a lot of data, and it's more over the history. I think I would assume that those that do it over the entire history will outperform those that just condense into memories. But the memories sort of have a higher weight because they're condensed from a chat. I wouldn't pretend to know exactly where that's going to go, but we are seeing a lot of experiments in that area.

G [41:24]:
How much time do you spend thinking about where the foundation models are going in terms of, you know, obviously I feel like the future state of like the size of these models, the way they interact with memory, with, you know, databases—huge implications for your business, probably unknowable to some extent today.

Simon Eskildsen [41:40]:
I do spend a lot of time thinking about it. I don't think I have a general answer to just like, "This is why it's fine." I think it's easiest to think sometimes in extremes. Okay, we get AGI. Well, none of this matters anyway. So, okay, right? So that's fine. Great. You know, like everything you—we have now will just compound, and everything's great. Then there's the other scenario, right, which is I think the one that we're rapidly entering now where the models are like incredibly powerful. They're very good at generating reports over large amounts of data, and it's just very clear that even if they get very capable, it just feels like you've got to yank a pipe into them and put something computationally good on the other side. And like I think the architecture could look a lot like turbopuffer because there's a lot of data you don't need all of it at all times, and it needs some kind of fuzzy search a lot of the times. I think that there's a real role for a database with this architecture. I think that to build a good database company, you need two things: you need a new workload, which is connecting lots of large amounts of data to LLMs, and you need a new storage architecture. And we talked all about that before, right? And why now for it. And so I think our hypothesis is that this will play a role in how these LLMs are going to interact with data, right? But we also see lots of people who just use turbopuffer for traditional search, right?

G [42:52]:
Before we started recording, you kind of said there would be this new set of things that would be table stakes for SaaS applications. And, you know, I imagine one thing that you think about a lot is just this flowering of tons of different applications that will need hundreds of millions of vectors and, you know, build applications on top of that. Like, how do you think about the, you know, addressable universe of what those companies might look like?

Simon Eskildsen [43:13]:
When I think about what I want out of some of the applications I want, like today I'm just like, "Oh, like I just talked about talking with someone about this over here." Like, can we just—you're essentially—a lot of what knowledge workers do has to do with funneling context around in different systems, right? So it'd be great if they could help with that, and I think we could help them help people with that. I think there's a bunch of features that are now going to be table stakes for SaaS in the same way that once mobile really hit, it became table stakes that everyone had a good mobile app. And it felt like a huge tax at the time, right? I have to bring in another programming language; I have to do this. And there's all of these things that were happening around the time around like, "Do we build them natively? Do we build them as web?" And now, you know, you just don't think about it when you build a software company that you're serious about; you just kind of need a mobile app, right? For most of them, not all, but for most. And I think that that's also what we're seeing with AI, right? It's that there is now a set of table stakes features that people expect there to be in your application in the same way that if they expect to search for your application name in the app store. Those features are things like I think that semantic search works, right? If I search in my Linear for chat, issues tied to Slack also come up, right? You know, if I search in a commerce store for burgundy, you know, whatever, right?

F [44:24]:
Yeah.

Simon Eskildsen [44:24]:
If I search for a shoe and they only have sneakers.

G [44:37]:
This must have been the most important eval at Shopify.

Simon Eskildsen [44:43]:
I think it's just one that I keep coming back to. I think semantic search is table stakes, right? And I think it works great as a byproduct of the LLMs. The second one is similarity. I think if people are just expecting this deduping and, "Oh yeah, there's something similar to this over here," you could also call it recommendations; it's, you know, a rose by any other name. The third thing you want is that you want the ability to generate a report, like sort of ask a question, and then, you know, it goes and finds a bunch of information, queries the data. Then you also want some agentic workflows, right? Cleaning up things like taking actions for you and all of that. And the agentic workflows probably want some of one and two and maybe also three to get them done. And there's probably others that are idiosyncratic to the particular application. But I think all of those are becoming table stakes AI features in SaaS. And I think we're seeing that the incumbent SaaS providers are doing a phenomenal job at building these in and really prioritizing it, and I think that the upstarts—there's a massive opportunity to try to get some of these like really, really right and build interfaces that are native around them. But that's how I think about the AI era SaaS, yeah.

G [45:54]:
Yeah. What do you see with multimodal data? Is most of the usage of turbopuffer today text-based, and like, you know, where do you see that going?

Simon Eskildsen [45:59]:
Yeah, I mean, it's completely possible to do something multimodal in turbopuffer. Again, I look at what the market is doing, not what they're saying. And I think that we don't see that many companies yet who are doing multimodal, like over images, over attachments, and all of that. Usually, the implementations lag a little bit behind, right? But I think it's great, and I think the economics of object storage make it really, really nice that you embed, you know, both the picture of the product and the description of the product and all kinds of other attributes around what you're searching. You know, the economics of turbopuffer might allow you to just embed all the PDFs and not think too much about what it's going to cost you, but that's otherwise been scary because, okay, someone just uploaded a 2,000-page PowerPoint presentation; are we just going to embed that and like not charge them extra? Like you don't expect all your SaaS providers to start doing usage-based pricing, right?

G [46:56]:
Yeah. Well, we always like to enter interviews with a standard set of quickfire questions where we basically just cram in all the questions that we didn't have time to hit in the regular interview. What company do you think would be most interesting to run AI at?

Simon Eskildsen [47:06]:
I mean, it would be one of the frontier labs, right? It would be like OpenAI or Anthropic or one of the ones that are just seeing the models of three or six months out.

G [47:15]:
Where's the name turbopuffer come from?

Simon Eskildsen [47:17]:
Do you want the real reason or do you want the marketing reason?

G [47:22]:
Definitely the real reason.

Simon Eskildsen [47:23]:
The real reason was that it made me happy, sounded funny, and it had an emoji that had no other real meaning.

G [47:31]:
That is a good emoji.

Simon Eskildsen [47:32]:
And then how have you made—have you turned that into great marketing? When the pufferfish is deflated, it's on object storage, and as it expands all the way into battle stance, it's in DRAM, right? And SSD in between.

G [47:43]:
Nice.

Simon Eskildsen [47:44]:
You must love—

G [47:44]:
You must have been very proud of yourself when you came up with that.

Simon Eskildsen [47:46]:
Yeah, you know, it's just—yeah, maybe a little bit, but it was not the original intent of the name.

G [47:53]:
What's one thing you've changed your mind on in AI in the last year?

Simon Eskildsen [47:57]:
You know, I'm getting a lot of—like on AI, I spend most of my time still thinking about databases.

G [48:02]:
Yeah.

Simon Eskildsen [48:04]:
And I think the biggest thing that I've changed my mind on in databases is I just keep being surprised that this simple thing continues to work. And it's not a great answer because it's not a good gotcha, but that would be my answer.

G [48:17]:
Yeah. What do you think is the biggest mistake you've made so far in running turbopuffer? Or something you look back on from a few years ago and you're like, "I wish we learned that lesson."

Simon Eskildsen [48:25]:
I feel like we, on the product at least, haven't committed any major mistakes yet. And I think people sometimes underestimate how hard it is to run product early at a startup. But the first few customers that we had used every single feature of the product, and there is not a line of code that wasn't being run in production. If I thought about it a bit harder, because we definitely made a million mistakes, I could come up with a better answer for you. But I think a lot of it is the survivorship bias of getting the product right.

G [48:58]:
What was something you learned as a founder?

Simon Eskildsen [49:00]:
I get a lot of people who tell me what they think that I should do. And—

G [49:04]:
VCs.

Simon Eskildsen [49:04]:
Yeah, especially VCs, you know, and I've really learned to trust my instincts. And I think that when we talk to the team and we have a feeling about something, and just giving everyone the permission is like, "Okay, let's just try it." That's worked great. Everyone says you should do embedding; you should do re-ranking and all of that. It doesn't quite feel right yet. The vibes have to be right.

G [49:36]:
Yeah, I'll let the vibe—always about the vibes. I assume you think about a lot of like, you know, questions about the future of where AI is going. If you could, like, you know, talk to someone from the future and get one question answered that would, you know, help you in building for whatever today, what would the question be?

Simon Eskildsen [49:51]:
How much is the agent searching?

G [49:55]:
The extent of vectors that the agent has to go through.

Simon Eskildsen [49:58]:
Not just vectors, but like how much is it utilizing a search engine, right? Like it's very clear that you're not going to do web search by loading that entire thing into context, right? How much are they searching?

G [50:09]:
You have an interesting story around how you learned how to code. You know, first you started on the online PHP resources, then you took a break because there weren't any more resources, you played a lot of World of Warcraft, which I was a big fan of, by the way, and then you learned English from that, and then you got back into coding. How do you sort of think LLMs will change how people learn to code, and then what do you think sort of the future of software will—software engineering will be with LLMs?

Simon Eskildsen [50:31]:
There's nothing more that I would have loved than an LLM to talk to when I was 11 trying to learn how to program, and just like the Danish web on doing web programming was too small. I just like—I mourn for my younger self to not have had an LLM to learn with. Like, I think about that a lot, and I think about it in the context of like my daughter and just like how a curious child now can just get access to so much in such an accessible form, and that brings me a lot of joy.

G [51:02]:
That's awesome. Well, this has been a fascinating conversation. I'm sure folks will want to pull on all sorts of different threads. I want to leave the final word to you. Where can folks go to learn more about you, turbopuffer? The floor is yours.

Simon Eskildsen [51:12]:
Yeah, so turbopuffer.com to learn more about the database, the trade-offs of it, and what it costs and everything along those lines. I mostly post on X, so x.com/sirupsen. turbopuffer is also on there, but turbopuffer.com is the best way, and on X, yeah.

G [51:29]:
Amazing. Well, thanks so much. This is a ton of fun.

Simon Eskildsen [51:31]:
Thank you so much for having me.

The infrastructure company powering the top AI apps

Transcript

Guides

API Docs