Billion-scale vector storage for RAG

November 04, 2025•Jason Liu Podcast

Transcript

Jason Liu [0:04]:
So, you know, I think one thing that people always talk about is doing vector search, right, doing vector search at scale. And when I think of, you know, databases that are truly being used in production, I mostly think of turbopuffer, right? They're the tool that Cursor, Linear, and Notion all use in their backends to do search. And, you know, I think in conversations with those folks, it's always been the case that whatever provider they were using before was very expensive, you know, very slow. I think in the past year or two, people have realized that turbopuffer has been a great solution, especially because it's backed on object storage, right? It's a tool that gives you full-text search, vector search, and as we talked about earlier, you know, filtering and aggregations. And these are all the ingredients you need to do search really well for the context of RAG, but also I think in the future a lot is going to happen in the context engineering realm with things like facets and aggregates allowing you to give context to the language model to make more search queries in the future. And today's guest, Simon, is the CEO of turbopuffer, previously leading a pretty big engineering team at Shopify. And we're going to talk about some of the design choices that he's made, right? How do you think about billion-scale search, you know, bringing some real case studies from the companies we mentioned before and how we think about things like tuning recall and latency. And so as always, if you have any questions, please ask them in the slider link that we shared on Zoom chat, upload the questions you want to see answered. And with that, Simon, take it away.

Simon Eskildsen [1:31]:
Awesome. Well, hello, everyone. So when I proposed this talk, I called it a billion-scale vector search on object storage. Slightly revisited here now. turbopuffer is at a trillion scale now of not just vector search, but search in general on object storage. Today, I want to talk a bit about the sort of the founding story of the company, like why did I decide to build the first version of this and then bring on my co-founder and the rest of the team later on to productionize this a couple of years ago. We'll talk about how people use it and why they've decided to switch over from other solutions and other storage architectures to this more novel storage architecture. We'll talk about the storage architecture that makes turbopuffer special, and it's the first completely object storage native database that exists and why this type of architecture is so suitable for this era of search needs. turbopuffer at a glance is a search engine on object storage. It can do semantic search, so classic vector similarity, and we can do this on trillions of vectors that we have in production today. It can do full-text search, so it's not limited to just semantic similarity, but can also do traditional BM25 search, and you can combine those two types to do hybrid search, which many of our customers today are doing. But turbopuffer cannot just do semantic search and full-text search, it can also do aggregations and group bys for facets that Jason and I were just talking about earlier. What makes turbopuffer really special is that it has a new storage architecture. And in this storage architecture, all of the data is by default stored on S3 or GCS or Azure Blob Storage, one of those three, and then all of the caching is put in front of it. So it works a little bit like a JIT compiler where the more you run it, the more you query it, the faster it makes it up into the cache hierarchies. It's literally the reason for the name is that the pufferfish inflates, right? All the way from deflated in object storage where queries take maybe around 500 milliseconds into disk where they take tens of milliseconds and into RAM where they can take less than 10 milliseconds. So that is the general architecture of turbopuffer that makes it special and we'll go into that in a second. So who's using turbopuffer? We have lots of customers that we've worked with very, very closely to productionize and make them successful in production. Cursor was the first customer of turbopuffer. We've been working with them since they were a small team in 2023 and grown alongside them. And we feel we've grown really tremendously with them. Notion, Linear, and others also work with us. Superhuman recently went to production as well. We work with one of the top AI labs and many other customers, hopefully some of which you know and use. And there are many other logos you don't always get the rights right away, but this list will continue to grow. So I want to talk a little bit about why you build a new database, because as many of you are likely aware, there's been search engines since the 90s, right? You've had Sphinx, you've had Elasticsearch, you've had Lucene. There's lots of different search engines that are all great for particular types of workloads. I fundamentally believe that if you are building a new database and you're going to make it, then you need two key ingredients. The first one is that you need a new workload. You need a reason for people to adopt a new database in their stack because it's a serious commitment to adopt a new database. It's a very sticky product, it's difficult to migrate, and the feature surface area can be something that is very varying across the different vendors. So you need a new workload. And the new workload today is to connect LLMs to enormous amounts of data, right? And in some instances also users to a lot of data. But connecting LLMs to new data is really the newest workload, right? Like in general we talk about RAG and that's literally what RAG is. One thing that's also very interesting about this is that the new workload in particular is very large, right? If you have a kilobyte of text, it easily turns into, after chunking, into four vectors of say 1024 dimensions, which means that you have now 16 kB of vectors from one kB of text. So it's really, you know, inflating the size of the data. We call this space storage amplification or size amplification in database parlance, and some people even use dimensional vectors that are much larger than these in cases to get more recall, more precision. These are enormous and it means that one company, the company that I work with, where they were spending $3,000 a month on their Postgres instance and they wanted to vectorize all of their unstructured text and put it into a vector index in another database and it would have cost them $30,000 a month when their Postgres was costing them $3,000 a month. Which is prohibitive for their use case and it has been prohibitive for a lot of other use cases around the world, because when you do this workload, you want to earn a return on the product that you can ship without the economics really becoming important. So the second thing is that you don't just need a new workload, you also need a new storage architecture, because if you just have a new workload, then you can tag it onto an existing database, right? The reason why say this is not a great fit for something like Postgres and others is just because of the enormous amount of data that this is to store on a traditional storage architecture, which we'll get into in a second. Some workloads are suitable for this, which are, for example, vectors. You need a new storage architecture to make it economical so that companies can earn a return on it. But some are not. For example, when geosearch really blew up with mobile in the early 2010s, that didn't need a new database because coordinates are tiny, right? And they fit fine with all the existing databases. But vectors in particular require a new type of index and a 10x cheaper storage architecture can really unlock new use cases.

Simon Eskildsen [7:35]:
The storage architecture that we can have today that we couldn't have 10 years ago and that no other database is designed to take advantage of in the same way that turbopuffer was when it launched sort of builds on three features. The first one are NVMe SSDs. These SSDs are incredibly fast. They are only about maybe five times or four times slower than accessing memory if you use them correctly and you drive I/O for a lot of throughput, but they're about 100x cheaper than DRAM. So if a database really can take advantage of NVMe SSDs instead of DRAM, well then you can drive really good economics. The second thing that you needed for the new storage architecture that turbopuffer has is that you need S3 to be strongly consistent. What this means is that if you put an object on object storage, then you should be able to immediately read it back and you know it's the same object, which you can probably guess is a very nice primitive if you're building a database to avoid having a whole other coordination layer. And then the third thing you need is for S3 to have compare and swap, which they only launched in December of last year. What that means is that you can get an object mutated and then write it back and be guaranteed that it wasn't modified in the interim. This allows you to build a database that is completely object storage native and has the economics that we'll go into in a second. So the architecture that you end up with is something fairly simple. You have a client, it connects to a binary. That binary is a database and it accesses a cache. It has a tiered cache into RAM when the pufferfish is fully inflated into SSD and then finally into object storage. That is the overall very simple architecture of turbopuffer that it's been since day one. Let's talk about the economics because this matters because I said that you need those two raw ingredients to build a successful database company or why there's room for a new database business. RAM is expensive, costs about $5 per gigabyte, you know, give or take with some error bars, but around $5 per gigabyte is a pretty reasonable amount to pay per gigabyte of RAM. Databases don't run entirely on RAM unless it's fully a cache because, of course, you shut down the machine, the data is gone. So generally, you store it on three SSDs, three copies, because even if one machine goes away, you still have two other copies. That costs about $0.60 per gigabyte of data stored. This is how most traditional databases like Postgres and others will store data. But if you store the data just in S3, well, it's $0.02 per gigabyte, right? It's like 20 times cheaper than storing it on three SSDs yourself. And if you can then take the data that's in S3 and cache it with an SSD cache in front, you're still a lot better off. But maybe not all of your data is even active so that a percentage is always in S3 and a percentage is in SSD and a percentage is in RAM and you're paying exactly for the performance characteristics that you need as the pufferfish sort of inflates on the subsets of the data sets that you need the most. So these are improvements that are an order of magnitude better on real production workloads. And we see some of our customers migrate and cut their first bill from their last bill with their past provider by up to 95% by realizing this architecture with us simply because their workload is a really good fit for it. We happen to think that most workloads are the thing that you need to do, and I alluded to this earlier to build a database that takes advantage of this is that you have to build an object storage first database that is also round-trip sensitive. What this means is that when you go to, for example, S3, the p99 on accessing an object on S3 is maybe 200 to 300 milliseconds, depending on which object storage you use. But in those 200 to 300 milliseconds, you can max out the network. You can get an enormous amount of data back, but every single round-trip takes around that long. This is very similar to how an NVMe SSD works; it's just a lot faster than 200 milliseconds. Every round trip is around 100 microseconds, but you can drive a lot of throughput in every single one of them. But you can't parallelize. If you parallelize across a lot of reads in RAM, it doesn't matter as much. Random read in RAM is very, very fast, of course, sequential is faster, but if you build a round-trip sensitive database, then it will work very well with S3 by going, going, going, getting some data and then a little bit more data and then serving the query rather than doing lots of round trips that are very slow and not getting a lot out of the data. Works really well in SSD as well. So if we talk about vector storage and building this object storage, round-trip sensitive data, then we have to be very careful about how we lay out the vectors to do the vector search. If you search in raw vectors and you just do an exhaustive search, it becomes very slow because if you have a billion vectors, you have to search terabytes and terabytes of data sequentially either from object storage and then in disk or in RAM, which is very expensive. Even if all of that is in RAM for a million vectors, it will take you hundreds of milliseconds. You can maybe squeeze it down, but it becomes very difficult to do in an economical fashion during a lot of queries. It becomes extremely prohibitive for doing cold queries on object storage. A graph index is a very traditional way of doing vector search. It became very popular sort of as the first wave of vector databases arose. And a graph index is also not suitable for an object storage first or round-trip sensitive database that works well for disk and works well for an object storage. The reason for that is that when you navigate a graph, you sort of get dropped in the middle of the graph, and then you navigate the graph by edges. Every single time you do that, it's 200 milliseconds at the start, then 200 milliseconds to get to the first layer, 200 milliseconds to get to another layer, and so on as you navigate the graph. That's very good in memory because you're not reading that much data and the latency is very low, but for disks and for object storage where you're much more sensitive to the number of round trips versus the amount of data per round trip, it's not a good architecture. Then there is this is actually the first really the first wave of vector indexes even before they became popular a few years ago, which are the clustered indexes. What you do in a clustered index is that you try to do a natural grouping, right? You have a clothes cluster, a food cluster, whatever the semantic grouping is, and then you put those adjacent on disk. So if you think about it very simply, you can think about it as on S3 there's a file called 1.txt, 2.txt, 3.txt, and 4.txt. Then for every one of those clusters, we take the average of all of the vectors in the cluster and we create another file called centroids.txt or cluster.txt. Now we only have to do two round trips to serve the query. We get all of the clusters from clusters.txt, find the closest say two or three, and then we download just those files, right? So imagine that at a much larger scale, we can now max out the network NIC to get all the centroids, then max it out again to get the clusters, but only do two round trips. So we can look at a lot more data with this kind of architecture, and it works very well for disk as well. And it happens to also work great in memory. Um, okay, there's some chat here. I'll answer those afterwards. Of course, every database comes with trade-offs, and we try to be very transparent in what the trade-offs are because I spent the majority of my career on your side looking at these databases all the time. And the first question is always, what are the trade-offs? What are the limits? And what is it suited for? What does it do and what does it cost? The trade-off for an object storage first database is that cold queries can be slow. Once in a while, you'll hit a server that doesn't have the data in cache and you have to go to object storage. And no matter how much we optimize it, it's still going to be hundreds of milliseconds, maybe half a second for that first query while we then start hydrating the cache, inflating the pufferfish for that particular set of data. You can mitigate that and still make it cheaper because you have a very cheap canonical source of truth for $0.02 per gigabyte that you could always have it in cache and it would still be much cheaper than any other storage architecture. But once in a while, you will have that cold query. The other limitation is that you will have high write latency, right? Every time you write to turbopuffer, you write directly to object storage and you cannot beat that. That's going to be 100 to 200 milliseconds of latency. And there's some economics with doing very small writes that are a little bit unfavorable to some workloads, but it's not something we see very much. Those are the fundamental limitations. They trickle into some other limitations like certain types of transactions being difficult, but really these are the only fundamental limitations with this architecture. It means that doing something like a very high transactional workload like Shopify, where I worked before, doing some transactional workload like that would not be suitable to do a checkout system on a database like that. But it is extremely suitable for indexing lots of data to search or letting an LLM search, because it's low cost, it's very simple, right? We're not, this storage layer, as the hundreds or thousands of whatever amount of people work on S3, working on that for us, and same for GCS on the GCP side. These are some of the most reliable, horizontally scalable, and durable systems on the planet, so we can focus on the indexing and the database itself. Warm queries can be just as fast as is in-memory database once it's in cache, and we can get extremely high write throughput and give this serverless experience where people don't have to think about node types, how many servers that are on, things like that, because that's really an S3 problem that was solved a very, very long time ago. And this gives us this advantage where we see write peaks of 10 million plus vectors written per second, and it works great because we can scale as far as S3 can scale, which is, I mean, we haven't found the limitations yet. So the architecture that you end up sort of building around this really also just as simple as you can imagine. Every time a query comes in, it gets routed to the query node that is responsible for that subset of the data with a consistent hash. And then if the query is cold, we will go to object storage, take about half, take a couple of round trips to get the data and serve it directly to the user. It might be noticeable, but it's not going to be so slow that you're sitting in front of the search box for a long time. Probably still be faster than a lot of searches that you find on a lot of websites. As the cache hydrates very quickly at a gigabyte or more per second, then the queries get much, much faster. For some of our customers like Notion, when you open the Q&A dialogue to work with your data, they will send a request to turbopuffer to start hydrating the cache for that particular namespace so that the subsequent queries are fast. Lots of companies can have this kind of pattern where they have a hint that you're about to access the data and then the cold latency goes down even further. On a write, you go to a query node as well, the same query node, and then we just write it directly into the cache to increase the probability that the new data is also in cache so that subsequent queries are faster. One of the things that's unusual about turbopuffer's design is that all the reads are strongly consistent. So once you've made a write to the database, it's immediately available on the next query with strong consistency guarantees, which is also the kinds of guarantees that S3 operates. We think that this makes systems more predictable and easy to reason about. You can turn this off for faster, better performance, but that's the default. When you write, we write into the namespace directory on S3. You can think about a namespace as essentially a prefix on S3 or a table or something along those lines, and we write into the write-ahead log. The write-ahead log is essentially just, you know, 0.txt, 1.txt, 2.txt for every write that you're doing, and we do a bunch of batching to save on costs. Once we've written enough data to update the indexes, then we will in the background use these indexers to compact and move the full-text search, the vector index, the attribute indexes, and all the various indexes that we build for the data to keep it fresh. The query nodes will then reach for the new indexes they get built and sort of page those new keys into cache as they get queried. The performance of this ends up being really nice, right? When the pufferfish is fully deflated, we still get really good cold latency. We can get as low on the P50 to almost 200 milliseconds for full-text search workloads. And the P99 even is around 500 milliseconds or 600 milliseconds. This depends a lot on S3 and all of their caching and things like that. But we do a lot of things in the background to do hedging and so on to try to keep this latency down as much as possible. When it's in cache, it's as fast as many other systems. The majority of the time here in the variability is really that for every single query, we go to S3 and make sure that we have the latest data, right? Again, turbopuffer doesn't have any other metadata or anything like that. So we have to go to S3 or GCS to get the latest commit to make sure we serve consistent queries. If you turn this off, all this latency goes down, and I think even the P99 is probably less than 10 milliseconds if you're okay with eventual consistency. I think we can talk about this if anyone gets into it. Let's cover some of the case studies here and then we'll turn it over to Q&A in about five minutes. So Cursor is one of our use cases, and I'll talk a little bit about how they use turbopuffer. So with their previous solution, they were using an in-memory vector database, which is really the first generation. In-memory and always in-memory makes a lot of sense even for something like Shopify where you're going to have the entire catalog being queried all the time. Well, you might as well have it in-memory, and the economics of turbopuffer is still a lot better, but you might still be able to earn a return even on it being in-memory on a more traditional storage architecture. But for something like Cursor, right, not every code base is active all the time, right? At any point in time, some percentage, I don't know, like 1%, 10%, I don't know what the real number is, but let's say some percentage of the code bases are active at any point in time, right? Those can be in memory or on SSD, and then the rest can be in S3 or GCS or wherever this code base is stored. So every single code base then in Cursor is just the prefix in S3, right? That can be tens of millions of these at any point in time, right, before they get GC'd out. So this pufferfish architecture really lends itself very, very well for a Cursor code base where as soon as you open a code base, we can start hydrating the cache for the namespace, and then all of the RAG that Cursor does will become faster. If you use Cursor today and the agents and things like that, you will see that it often does semantic queries like, "Hey, we're in the code base that does this." And that's using turbopuffer behind the scenes that Cursor is keeping up to date that uses embeddings and re-ranking and so on to draw the right context in. So this helps Cursor both optimize their inference by finding the relevant context and as little of it as possible, but also find things that can be very difficult to find with grep or even letting the agent grep around. So in my experience using Cursor's agent, it's very good at these kinds of tasks where an agent might need a lot of attempts to grep because it can find it right away with the semantic index. Cursor has their own embedding model and they are very good scale. When they moved to us and moved to our storage architecture from the previous provider, they cut their cost by 95%. One thing that excites me more than cutting the cost for our customers is for them to realize the most ambitious version of their product. And in Cursor's case, this allowed them to index much, much larger repositories than was economical for them before. The other thing was that before on the traditional storage architecture, they had to be very careful about which servers had what code bases and do all this bin packing. With turbopuffer, they don't have to do that because it's horizontally scalable into as many namespaces as you want. Notion is another customer of turbopuffer, and they save millions and millions of dollars when they moved to our storage architecture. It's also very suitable for Notion, right? You have a lot of workspaces and some subset of them are active at once, and now you can realize those economics. They have more than 10 billion vectors, they do really large write peaks, and have millions of namespaces for all of their data. One of the things that I really liked was that once Notion moved to us, they removed all the per-user AI charges, and they've been really, really good partners to us. The last use case I'll show here is Linear, another one of our customers. They were dealing with Elasticsearch and pgvector before, and they wanted something that was really hands-free where they can just pump in all the data. They didn't have to worry about it; they didn't have to think of machine types, and they didn't need anyone to operate it and be on call for it. And they got the cost reduction, but that just made them more excited to connect even more data into the LLMs. They really think about us as this foundational search layer. I think every SaaS is now expected, in the same way that 10 years ago we expected all of our SaaS to ship a mobile app. Well, today we expect every SaaS to have a semantic search. We expect them to have some kind of a generative mode research. We're going to expect a baseline of AI features as these SaaS platforms evolve, and for Linear, they really thought of us as the foundational search engine for all of that. With that, exactly on the 30-minute mark, I will hand it over to you, Jason.

Jason Liu [30:00]:
Oh, you know that. I mean, you're free to continue any other slides. I saw some pretty interesting slides in the appendix, but let's jump into some questions.

Jason Liu [25:38]:
Yeah, I definitely sort of see the case for these cost optimizations. Maybe not in 2025, but I remember in 2020, I was talking to some companies where they were using some sort of called like legacy vector search systems, and it was the case that they could not go GA until they moved to turbopuffer. And those kinds of stories is kind of what really caught my attention earlier on in those days. Just looking at some of these questions, I feel like a lot of these are actually just comparisons against other search systems. People are really curious about tools like Elasticsearch and Qdrant. I think, you know, generally when I think about things like turbopuffer, I think about the fact that, you know, Notion, Linear, Cursor, they have these really well-defined partitions. There's like workspace, there's a repository, you know, but what would it look like for turbopuffer to power something like a Twitter or e-commerce where there's just no natural partitions here, or maybe there are? I'd love to hear your thoughts on that.

Simon Eskildsen [26:23]:
Yeah, so the partitioning was really a go-to-market move, right? Every startup's weapon is focus. And our focus is that the only thing that scales even now is sharding. I learned that at Shopify; we shard on shop for everything. And so we felt that a really good way into the market was let's give people unlimited sharding and give them a really good experience with lots of many shards, right? That was Cursor and Notion. When we used to not be very good at large shards, right? If you were in 10 million, we would actively tell people, "No, go use something that's been around for a while. Go use Pinecone, go use Qdrant, something like that." We are not, that is not our ICP. Now we're very good at that. We have customers that run in production that have namespaces that are in the hundreds of millions. We're working with customers on basically building Google, right? Like they want to search 100 billion documents all at once. And that's what we're working on. We have customers that are searching 1 billion plus documents, and we are getting very, very good at this. The trick here is that when you search, and this is the same in Elasticsearch, any system that's scaled does some kind of sharding, right? To use multiple machines. The trick is you want the shards to be as large as possible. A small shard is like when I ran Elasticsearch at Shopify, we were targeting our shard sizes at around 30 to 50 gigabytes. And so if you have a data set that is in the hundreds of terabytes, which billions of products easily is, you have to search, you know, M times log N or whatever the complexity is of your searching. That's not actually the complexity of an inverted index, but let's say it's log N. Well, if M is very high, you're spending a lot more computational resources than making the M smaller and the N higher. So you want the largest shard sizes you can. And that's what we're trying to work on right now to get very, very large individual shard sizes. So in order to do 100 billion, you have to run many indexes, but we will continue to make those as large as we possibly can. I think in terms of implementation of this for very high throughput and in-memory, Qdrant's implementation seems like something that some of their customers that I've spoken to have really good success with. I think where it starts to get difficult to maintain an HNSW at that scale is if you have a lot of churn in the data and you're doing very high write throughput. So we'll get there; it's not a fundamental compromise in our architecture. We will get good at it. I think we're going to get exceptional at it, and I think the results of the POCs right now are quite good. To compare us to Elasticsearch, it's really that traditional storage architecture, right? I've been on call for Elasticsearch. It's probably the worst database I've operated in my life, and part of this company is my vendetta against it. So I do have some bias, but it's probably gotten a lot better since I worked on it almost 10 years ago. But it has a more traditional storage architecture, right, of two or three SSDs. You run those at about 50% utilization, and disks are smaller, so it can be difficult to realize the economics you need, right? At the end of the day, an infrastructure decision is a set of trade-offs and then economics you can earn a return on. So if you have a per-user cost of your Elasticsearch cluster, whatever cluster you're using, of $10 per user and you're charging them $20, well, that's not a good return, right? And so what we want for you is to increase the ambition of your product, index more data, and have a better return so you can get to the gross margin that you need to get to.

Jason Liu [29:57]:
Yeah, great answer. I definitely like that a lot. I guess one question that came from Adam actually is around cache controls. The question is, what kind of cache controls do you hand over to the user, and how much tuning typically goes into achieving these costs and performance goals?

Simon Eskildsen [30:12]:
So by default, we don't really want you to think about it. So for example, if you use S3 or GCS, you can turn on automatic storage tiering. What that means is basically if you don't access the data for a while, then it just goes somewhere else. You access it again, then it goes to a lower storage class. That's how we think about turbopuffer as well. So we want to default to really good cache behavior. We don't want you to have to think about it. We don't want you to have to configure a namespace to be cold or warm. If you don't have that many controls, because generally our users don't need it, there's very specific edge cases like this case I'm talking about about searching 100 billion web documents. Well, if you're doing that, then, yeah, we're going to work with you a little bit on the caching before we know the heuristics to get this right at that scale because it's a difficult bin packing and cache problem. But for most other workloads, the default behavior is phenomenal. The main control that you have is that you can send a hint_cache_warm request to turbopuffer. If it's not in cache, we will charge you one query and start hydrating the cache, and if it's already in cache, it's free. That's the main trigger that you have today, and it works great.

Jason Liu [31:20]:
Oh, I wasn't aware of this one, but one of the questions is, will you remove the minimum spend requirements at some point for these smaller use cases?

Simon Eskildsen [31:31]:
Yeah, it's a good question. So I think that if you have a very small use case and you already have a Postgres database, you should like you could probably get away with pgvector. I think as we mature our product and think that we can provide you that we're less of a knife and more of like, you know, a drawer of tools to help you do your search, then I think that it makes sense to do this. The main reason we have the minimum spend requirement is because we really want to give people a really, really good experience. We take it as an extremely serious commitment that you are trusting us with your uptime and scaling a support team and an on-call pager and all of that to be extremely responsive if there's any issue, even if it's not by us. That's why we have that minimum spend requirement. It is not an infrastructure minimum. It's nothing like that. It's really just to guarantee a good experience, and we expect to lower that minimum over time. And it's not that we won't have a free tier ever. It's just not the right choice for us right now to have a free tier and support it with the high-quality support that we've come to pride ourselves on.

Jason Liu [32:37]:
That makes a lot of sense. I guess this is the question I'm also curious about because I noticed this in Cursor sometimes, which is that, you know, maybe for Notion documents and for Linear tickets, there's not much editing of these data objects, but you can imagine in Cursor, if you change a file very quickly and make another query, how do you think about refreshing the index and how should we think about designing such a system that maybe is like a little more write optimized?

Simon Eskildsen [33:02]:
Yeah, so I mean, first off, turbopuffer is very write optimized. It's partially very write optimized because Cursor does a lot of writes and Notion does a lot of writes. I think there's a couple of angles to talk about this question from. So the first one is that, and this is a realization that many of our customers have made. And the first time it was explained to me, it's like, "Oh, yeah, this makes a lot of sense." If you're doing full-text search, you kind of almost want to re-index on every keystroke because, you know, the exact string changes on every keystroke. That matters for full-text; the semantic meaning of a chunk doesn't change on every single keystroke in the same way, especially if you're using hybrid search. So we've seen some customers that, because creating the embeddings is often much more expensive than storing it in turbopuffer, right? So we see customers that find some compromise that makes sense to them. They do the vibe check of like how many characters, how many bits, what's the added distance before we have to re-embed, and the semantic needs captured enough, and then debounced by some time, right? After some time, you always make the change. That's very common. I think it's a very interesting observation. The second thing is just that there is an economical piece to it, right? Do you want to do this like all the time or do you want to do this every minute or how is your pipeline set up? It costs money to keep these ANN indexes up to date, and so you have to make some reasonable compromise here that you can earn a return on. I think we're very sympathetic to users that want to do a lot of writes, and so the system is heavily optimized for it. But it's a choice made by the users, and it's not something that we really provide any particular constraints on. Lots of our customers do it close to every keystroke, and it's generally driven by the economics of creating the embeddings.

Jason Liu [34:51]:
Next, on the topic of embeddings, this is a question that a couple of folks upvoted, which is, is turbopuffer optimized for late interaction models like ColBERT, where you're storing multi-vectors per chunk?

Simon Eskildsen [35:06]:
So it is and it isn't. It is in the way that one of the largest challenges with late interaction models is what I, you know, this is like my pet peeve today, apparently. It's very hard to earn a return on because it's an enormous amount of data. It is the best or some of the best precision that you can possibly get. You can do it with turbopuffer. I can share a gist or I can send it to you, Jason, at like kind of a pastebin of how to do it with turbopuffer, right? But like fundamentally, all it is is that for every token you send a top K of let's say a thousand. You do like, you know, like let's say you have 10 tokens in a query, then you do like all of those. You could do a multi-query turbopuffer, you get that back, and then you do a second layer of queries. You don't have to get all the vectors, and then you emulate interaction results. The question is whether you can live with the economics of it, right? You can squeeze the embedding small enough, you F16 and all of that, and you will continue to optimize the economics. I've had that pastebin around for six months. I haven't seen anyone who's put it in prod yet. I think someone here should put it in prod, and we would love to work with you because we have the multi-query pattern; it's totally possible to do in turbopuffer. Now, what you need for realizing late interaction in performance, and part of the reason why it's not adopted, the math really just comes down to economics, right? How much are you willing to pay for that extra 10 to 20% of precision that you can get with late interaction over like some of the late chunking techniques and other techniques that you can otherwise use? But if you have something where you have a small amount of data and you just care about precision, I think you should try it, and I think turbopuffer would probably by far be the most economical way of doing it. But no, we don't have a, here's like all the vector to late interaction. You have to string it together, right? But it's very simple to string together.

Jason Liu [36:57]:
On the side of economics with embedding models, one of the questions here is around quantization. Does turbopuffer offer to do its own quantization internally? How do you think about optimizations on that front? Or is it the sort of developer itself who's responsible for writing these vectors up in turbopuffer?

Simon Eskildsen [37:12]:
Yeah, so we do quantization and like clustering and all kinds of things, right, to optimize the performance of your index. If you send us smaller vectors, then we'll index the smaller vectors. At this point, we don't support like integer or binary vectors, but we will. The smaller you'll be able to send them, the better the economics and the performance is going to be. Because if you send us a, you know, F32 1024-dimensional vector, I have to store that in full precision because you want to be able to export that later on at full precision. So I have to charge you that way, right? So if you can pass on something quantized, well, then you don't have to store it at full precision one. So it sort of becomes a bit of like, I guess, sort of an API problem. But the other thing is just like getting all the SIMD instructions right for all the different architectures and things like that for all the different quantizations is just what takes a bit of time. And we find that most of our customers are very happy with the economics of F16, but we'll continue to quantize smaller and smaller as our users request it. But I think fundamentally the way that I look at it is that quantization, I think quantization cuts across a number of dimensions. I mean, I've talked to big embedding providers around this, and it's like roughly right. But it's like what matters is like how many bits of information are you passing, right? And one lever for that is how many bits you have per dimension, and then the other lever is how many dimensions you have. And so whether you have an F16 that's like 100 dimensions or you have a binary vector that's like, you know, many more dimensions but then, you know, one bit per dimension, it's the same number of bits that are there to compress and project a learned representation into. So you should be able to fundamentally get away with an F16 that's a lot smaller if the embedding model is optimized enough, and that's what we've seen from some of the sophisticated users. They just train on F16 because it's extremely optimized in the CUDA pipelines, performs really well, and then they do the truncation and normalization with Matryoshka learning to get a really good representation. But yeah, that's probably a way longer answer than they wanted.

Jason Liu [39:23]:
That was great. I feel like this is the first time, again, I've heard this depth of technical talk. This next question is pretty good. So, you know, it might be a whole session on its own. There are a lot of questions around turbopuffer's facets and aggregates and how they're actually useful in practice, right? Maybe in the legal context. Given that you are from Shopify, could you talk a little bit more about like how maybe facets could be used in the traditional context and then we can maybe extend into the agentic context?

Simon Eskildsen [39:55]:
Yeah, I think you should talk a bit about the agentic context because I know that's something you're very excited about, and we share the e-commerce path, so I'll talk about it there. I don't know if this is still the case, but actually I was part of the team that shipped facets at Shopify, and we actually did it on MySQL because not all the collections in Shopify were powered by search; they were powered by MySQL collections, and then the search facets were different. So there it's just a big select group by union; that's like really what it is. So how are facets used? Well, I mean, in an e-commerce context, right, it's used as like, okay, I'm doing this search and I want the different colors, so like blue, orange, whatever, right? So it's really a select count from table group by and then the facet, whatever you want to facet on. So you might want to facet on color, you might want to facet on size, and then there are some really weird ones like faceting on price because with price you want to roll this up into some distribution that makes sense, and the cardinality gets very high. But fundamentally, it's really about like when you go to an e-commerce site and you do a search on that left-hand side, you want to have some relevant filters popped up. I'm sure all of you have seen really bad faceting where it's like they have every single price listed on the left side and like every size, and there's like red, burgundy, maroon, and it's just like, no, I don't care; like I want this normalized, right? The form of faceting that's in turbopuffer today is like very simple, and it doesn't support every single one of these cases. Like faceting on a vector search is like kind of funky because a vector query has a result for every single entry in the data set, so you kind of need to like pass a threshold and then facet on that. So you might want to facet on the full-text search as part of the hybrid search, so it all gets a little bit funky, but we're sort of expanding that with feedback. So if any of your turbopuffer customers are about to be, you should drop that kind of feedback in the community channel. But Jason, I think you should speak about it in the agentic context.

Jason Liu [41:59]:
Yeah, totally. So I think, you know, one of the things we think about in terms of like what is the difference between like traditional vector search versus this like context engineering is the fact that in the context, we're trying to give it clues, give the agent clues to make better use of tools in the future, right? So you can imagine a simple file search might just be a text input, and then I return 10 chunks. Right. But what if I can also go back and say, you know what, of these 10 chunks, you know, actually 45% of the chunks that would have been returned come from this file or this directory, right? If we include that in the context and then we tell the agent, you know what, if you make subsequent queries, we can return you full pages from a document, right? If we think this is useful, we can return for you whole documents from a, take whole pages from like a PDF, you might be able to load better data in the future, right? A simple example, encoding agents could be something like using grep or using find. If you make the find command, really what you're doing is you're kind of counting the number of files or how many files have how many occurrences of a certain keyword. And then the next tool you call is a read file tool that recognizes that the find query has resulted in some file that is really relevant. Right. And this is because we've designed a portfolio of tools with a range of parameters that you can use to filter against and do all these other interactions. And so by having more information in the search result in the form of facets, we can then use those facets to make better searches in the future. This is the same thing in e-commerce, right? Maybe I am searching for shoes. And so my first query is just shoes. And then maybe I get two facets. Maybe I get a brand facet and then I get a, you know, stars rating facet. And I realize, oh, what I actually want is I wanted five-star shoes for Nike under the $50 price point. I can click those three facets and make a new search query that's just, you know, maybe, you know, find me running shoes filtered by Nike, filtered for stars, filtered by price. And you could imagine a sophisticated agent could have done this in one step. But the truth is, we didn't have that context, right? I did not know there were so many Nike shoes. I did not know I had the ability to filter on stars. I did not know what the reasonable price filter is. But by providing that in context, you can whittle down your search, right? And this is really going down to this idea that even with worse search tools, because these agents are so persistent, as long as you give them more context and the context can be used to call the tools in a better way, you're going to get better performance. And I think that's one of the things that many vector databases lack, which is the ability to provide additional context around data that wasn't returned. And that's a pretty long-winded answer, but that's kind of why I'm excited about some things like this, because Elasticsearch provides them, but not many vector databases. I'm curious if you have any thoughts on that.

Simon Eskildsen [44:56]:
No, I mean, I think it makes sense. I mean, I think I had this in one of the slides, but it's like every successful database ends up implementing every query. I think one of the unfortunate things about the way that the query language has evolved in Elasticsearch is that it's very, very difficult to use, and it's very difficult to understand what's important. And that's one of the strengths that we have, right? And in working closely with our customers and feedback from people like you, Jason, is like, we can have some opinions in terms of what we should prioritize for the context. And I like this; I like the framing of I just want a summary of what's important about everything that wasn't in the top K. That just like you as the database person know that could be done in a reasonable latency estimate, right? And so I think that's great and really interesting feedback.

Simon Eskildsen [46:34]:
I think in legal, I mean, I don't know anything about legal other than the contracts that I have to read, right? But like I could imagine that I, if I'm searching in my contracts, I want to know like by customer, right? And then be able to filter to that because I have like a custom DPA, a custom MSA. I might have like some amendments to the agreement, blah, blah, blah, blah, right? I might want to search for like what lawyer worked with me on this, right? Who was involved? Who is the signatory? Right? Like all of these different things that are just essentially these follow-up filters that both a human and also an agent would find really interesting to discover in the data that, I mean, at the end of the day, right, like what something like turbopuffer really just needs to do is to allow humans and agents to converse with the data. So we need to provide all of the queries that can do that. And so faceting is a part of that because there might be hundreds or thousands of different colors, right, in a collection. And we can surface some aggregate that makes sense because it doesn't make sense to load a billion records into a context window, right? The computations required on the attention to do that is just not possible to earn a return on. So what does that database look like? Well, it's going to have some kind of fuzzy search, but it's also going to have some other massaging of the data that you can do to pivot the data around and massage it to explore interesting things about it.

Jason Liu [47:06]:
I mean, to answer the question in the legal domain, you can imagine an example where maybe I am searching for some clause in a legal context, and so I return 10 text chunks that have the clause. But maybe what I actually recover is the fact that I have some facets on the file name, and I realize, you know, 90% of the clauses came from three files. The agent might then say, you know what, I'm not going to do my semantic search query next; I'm going to just load the whole damn PDF, right? Because the PDF says this PDF ID has been referenced 10 times by the semantic search. Just read the whole damn file, right? And it's a separate read-file tool versus a search-chunks tool. And that could be another simple example of that.

Jason Liu [47:55]:
That's great. That's great. All right, we have four minutes left. You know, the question I generally ask everyone at the end is, you know, what is something you feel like folks are not thinking about when they are thinking like reasoning about these kinds of systems? Yeah, I'd love to think, yeah, when people are building these things, what's something you think people are not thinking about these days that they should?

Simon Eskildsen [48:13]:
I think they're not thinking about the latency of the embedding model. There's a lot of embedding models out there. I'll pull a very concrete example because it's one of the most widely used. The OpenAI text-embedding models are great, but the problem is that they have very high P50 latency. The P50 latency you get is 300 milliseconds. Well, it doesn't matter that the turbopuffer latency is 8 milliseconds when it takes 300 milliseconds to create a query vector. So we've seen from customers where they create hundreds of millions or billions of embeddings, and it's not just OpenAI; there's lots of these providers where the latency is very high, and then they use it fine for these agentic and Q&A workloads, but then they're like, "Okay, it's time to do real-time search," and it's too slow. And they've already created all these embeddings, and again, it's not like people are not right now limited by the turbopuffer cost; they're limited by the embedding costs and re-embedding everything to switch models. It sucks, and so don't fall into that trap. Make sure that if that's a use case that you care about, that you measure that embedding latency. We've seen great latency from models like Cohere and Gemini. I think Voyage has also really gotten very good at latency. Together and Jina have also had pretty good latency, but there's some of the models that really just don't have very good P50 latency. So just make sure that you test it yourself; it matters not to just take blanket advice because the latency from the region that you're in to whatever region that they're in also matters because that's the region where they get the cheapest GPUs so that they can earn a return again. That's what it all comes down to. So you want to do that testing for yourself. It's very easy to do. You can ask Cursor or whatever coding agent of choice you're using, and it can do this test for you in like five minutes, but you got to do it so you don't paint yourself into a corner.

Jason Liu [50:07]:
That's actually a great answer. I should just go out and make this run script that people can just try out and benchmark everything against.

Simon Eskildsen [50:14]:
You should. You should do it in your course because legally you can't do it in public — the ToS prevent you from doing that at scale.

Jason Liu [50:15]:
I'll put that on the to-do list.

Jason Liu [50:21]:
Perfect. Yeah, we'll do that as the next action item. With that said, you know, if there's any other message you want to have for the audience, feel free to share that now. And then I'm going to follow up with you afterwards, get some links, and we'll distribute them in the reader notes.

Simon Eskildsen [50:36]:
Yeah, I mean, no, not really. Come try turbopuffer if any of this resonated.

Jason Liu [50:42]:
Everyone else, thank you for tuning in. We'll send the recording and the notes afterwards. And again, Simon, great talk. Learned a lot. And see you around. Thank you all.

Billion-scale vector storage for RAG

Transcript

Guides

API Docs