Query prices reduced by up to 94%

How do vector (search) databases work?

March 29, 2025The Geek Narrator Podcast

Transcript

Kaivalya Abte [0:00]:
Hello everyone and welcome to the show. This is the Geek Narrator podcast. I am your host Kaivalya Abte and today I have Simon with me, and he's coming back on the show. Last time we talked about how you can think about building a database on top of object storage, which was an interesting idea. We had a lot of interesting discussions around it. And today we are going to talk about all things vector, like vector search and vector databases. We'll try to understand from Simon's experience how to think about it and how to build a mental model to solve and build applications using vector search and vector databases. So welcome, Simon. I'm really excited to have you back on the show. For our viewers who are new to the channel, who are new to listening to you and new to turbopuffer, let's start with a bit of introduction about yourself, about turbopuffer, and what you have been doing these days.

Simon Eskildsen [0:56]:
Yeah, absolutely. My name is Simon. Thank you so much for having me on the show again and to be back. Last time we talked about building a generic database on top of object storage. I think today we'll get into turbopuffer and building a search engine generically on top of object storage. You asked for a little bit of the origin story of, I guess, the database and myself. I spent almost just shy of a decade building infrastructure at Shopify. I saw the rise there from hundreds of requests per second up to a million, and as part of that, I worked on almost every single part of the infrastructure, including search. When I left Shopify in 2021, I spent a little bit of time hopping between my friends' companies, consulting on infrastructure, investing a little bit of equity in their companies, and helping them out with something that was top of mind. For one of the companies, the biggest thing I did for a lot of these companies was tune Postgres auto vacuum, which is very difficult to tune and a bit foreign coming from operating MySQL for almost 10 years at Shopify. But one of the companies, this company called Readwise, builds an engine to help you read better. They have articles and highlights from your articles and books that you read from Kindle and things like that, and they wanted to do recommendations or at least try it out. So we ran a two-week sprint on doing recommendations, and I was really interested in vectors because back at Shopify, I had worked a bit on search, mostly on the infrastructure side, but you end up working a bit of everything. Bringing relevant search results is kind of a nebulous activity, right? If someone searches for "red dress" on an e-commerce store, you need to show them a burgundy skirt because you don't have a red dress, but maybe they're interested in this. There's a lot of that, and taking strings of data and turning them into things, as my coworker Doug always said, we need to make strings into things, is very difficult when all you have are strings. Of course, vectors solve that problem. They take a red dress and a burgundy skirt and place them adjacent in the coordinate system, and we wanted to do that with articles. With LLMs, you basically chop the head off the LLM and outcomes embeddings because there's an internal representation that often these are trained on. You do some fine-tuning specifically for this, but that's a simple way to think about it. What we did was take these articles and build vector representations of them, thinking, okay, these articles that are going to be adjacent in vector space with a small distance between them in the coordinate system are going to be good recommendations. We actually got pretty good recommendations out of it. They were okay, not phenomenal. This was early in the LLM and embedding days in 2022, but it was pretty good, and it's better than I could have done with strings. We ran the back of the envelope math on putting this on one of the vector search engines at the time, and it was a phenomenal cost, around $30,000 a month for a company that was spending maybe $4,000 to $5,000 a month on their Postgres database. It just seemed out of reach, and it was a sad event because we had this idea that actually worked, and we wanted to come into the product, but cost was holding us back. I distinctly remember Dan, one of the co-founders, saying, "Well, I'm sure the costs are going to drop just as they are for LLMs." Four or five months later, I couldn't stop thinking about this problem, and no one had come out with a vector search solution that was more economical. It was literally holding product back at this company, and they thought, "Well, there must be some latent demand here." There's Jevons paradox, which often comes up alongside AI and falling prices — it basically says that if you make gas cheaper by, let's say, 20%, then people often drive more than 20% more. You can intuit that this is as in, you know, if you make gas cheaper, you live in Berlin, right? And you can make gas cheaper, and there may be some trade route to some other city for some local company that's selling whatever pastry. Now they're going to start offering to also sell that in Hamburg because the fuel costs are now working out. The activity you get there is the same argument for making energy cheaper is nonlinear. I thought that might be here for search engines the way that it has been for analytical data in the 2010s. I didn't see all this at the time, so there's a little bit of retroactive analysis here. But certainly, I saw the example firsthand with Readwise. Thinking back on my experience at Shopify working on search, it was very clear to me that working on the Elastic clusters at Shopify was one of the hardest databases to operate, right? For all the right reasons, and it was the best thing on the market, but it was a very difficult thing to upgrade, scale, and it was holding back a lot of search for a variety of reasons—cost being one, indexing performance being another. I think back to the days that I've been debugging these clusters until midnight trying to get them to perform better or some incident. My experience at Readwise and thinking that, okay, you know what, maybe it had become clear it was time for a new search engine, and vectors seem to be pressing this even more because when it comes to text, which is still the most ubiquitous form of data, the vector representation of the text is often larger than the canonical text representation. You can imagine an article that's maybe 1,000 tokens long, right? So you represent that as 1,000 tokens. Maybe the tokens are five bytes on average, so that's about 5 to 6 kB of data for the raw text, which compresses incredibly well. It's going to be hundreds of bytes to store this. Now, a vector representation of that text, you're going to have multiple vectors, maybe one per paragraph or something along those lines. Let's say that this 1,000-token article is now four vectors, and those vectors might not all be 1,536 dimensions, so that's around 6 kB each. Now you have four times 6 kB, right? So 24 kB. You can't really compress these vectors because every floating point has so much semantic information in them that you can't really compress them well, but you can with text where there's a lot of repeatability. Now you've got this vector representation that's almost 20 to 30 times larger than the compressed representation of the text itself. That's why we saw at Readwise that the vector cost was so high because all of it had to be stored in memory for a variety of reasons, and it was larger than all the original data and the compressed representation inside of Postgres where the article was also stored. That's always been a problem with storage. It's always been expensive because even inverted indexes also have this storage amplification, right? It stores how many bits do you store per bit of input data? For an inverted index, that might be somewhere between 1.5 or 2.5. Sometimes it's smaller than even the original text, depending, but it's generally in that range. But now we're talking about a storage amplification of 10 to 30x, right? If you build a database on top of object storage, as we talked about in the other episode, the fundamental trade-offs are high write latency to write directly to object storage in the hundreds of milliseconds and high tail latency in the hundreds of milliseconds when something is cold on object storage. These seem like phenomenal trade-offs for search, especially if you then make it 100 times cheaper, which S3 is, right? Two cents per gigabyte, and memory is around two cents or $2 per gigabyte. That seems like a good tradeoff. On subsequent queries, things are loaded into memory, and things are going to be fast. So that is kind of the origin story of turbopuffer.

Kaivalya Abte [9:15]:
Very interesting. It looks like the economic cost was at the center of your decision-making, right? Because that was the main problem, as you discussed about the storage amplification. Apart from cost, of course, like we discussed in the last episode, building something on top of S3, you already have a lot of guarantees around durability and the scalability of the system. You don't have to build that into your system. Of course, you have to think about it and handle some of the edge cases, as you said, about latency, write latency, and tail latency, but then that also simplifies your architecture. You also mentioned some of the terms like you used the term "embeddings," you used terms like "vectors" and "dimensions" and "tokens" and stuff like that. I don't want to assume that all our viewers know all these basics, so let's quickly get these basics out of the picture so we can talk about the real stuff, like how it works. What exactly is vector search? What exactly do we mean by embeddings? We can probably take an example and talk about it and explain in really simple terms what they mean.

Simon Eskildsen [10:34]:
Yeah. First off, I use vectors and embeddings somewhat interchangeably. I think there might be a more precise definition for each, but for all intents and purposes, we can talk about them as the same. What are they? Well, LLMs are very, very good at knowing what data is relevant to other data. That's why they're next token predictors, right? You have "hello," and it thinks that there's a very high probability that the next word is going to be "world." That's fundamentally what an LLM is able to do, and one of the internal representations of that are a bunch of floating point numbers. You could think about them in the simplest way that if you had a two-dimensional point and you train the model to figure out the points that are close to each other in space are also semantically close to each other. What I mean by that is that you take a banana and you take an apple, and you train the model to say, "Well, they're pretty close." They're not quite the same; like a pear is maybe closer to an apple than a banana, but they're pretty close. I mean, they're at least closer than a computer is to any of those words. Now you end up with a coordinate system, right? It's easiest to intuit this in two dimensions. You end up with a coordinate system, and you will end up with essentially these clusters, right? There's going to be a fruit cluster, there's going to be an electronics cluster, there's going to be a whatever animal cluster, whatever, right? The information that is close in the coordinate system is also semantically close in the real world, right? So that's really what a vector embedding is. It's a point in a coordinate system where stuff that is related to other stuff is close. You could think about the Euclidean distance between the two points, so that's just what you did in geometry in sixth grade or whatever, right? LLMs are very, very good at taking information and turning them into a bunch of floating point numbers in this vector space. The reason we call it vector space and not a point is because it's not in two dimensions; it's in maybe some multiple of 64, 128, 384, whatever, thousands of dimensions. You can think about each of these dimensions maybe representing a real thing. We don't really know what they do, right? It's latent space. We don't completely know without spending a lot of effort on it, but you could imagine that maybe one of these dimensions represents something about size, right? An elephant has a very high floating point number as the first one, and an ant has a very low number. Maybe dimension number two represents something about color, right? Maybe dimension number three represents something about where in the globe this thing often presents itself, right? You can imagine then you can represent almost every object like that, and that's what LLMs are very, very good at. LLMs are good at taking information and turning them into a coordinate system, into a point in a coordinate system. What we were not so good at was storing this coordinate system, right? And making it cheap to store and cheap to query. It's been cheap to take information and turn it into a point in a coordinate system, but not so cheap to store the coordinate system and ask it questions in high-dimensional space of what's adjacent to this, what's close to this point, what's the average of something that's close to these points, or whatever questions you could think to ask such a coordinate system. That's how the ANN, or the approximate nearest neighbor index, is representing that coordinate system in a way that's efficient for orienting.

Kaivalya Abte [14:33]:
Interesting. I think that's a very clear example and clear picture of what all these things mean. Of course, there are so many dimensions to think about when we search for things and not like a direct semantic search. That sounds good. Let's take an example for our entire discussion so we can base our discussion around that example and try to model that and solve that in a way using vector search and vector databases, and how we can do that, let's say, using turbopuffer. The example is again related to the podcast. So far, let's say I have more than 100 episodes, and I want to kind of search across all the episodes that I have. Give me all the clips. This is like the high-level feature that I'm talking about, right? Then we can go a bit lower. Give me all the clips of discussions where we were talking about a certain database or a certain index, like binary search or something like that. Given a specific topic, I want a clip, like timestamps of all the clips that I have, right? Specific to my podcast, for example. If I have to build this, how do I start thinking about this problem space? How do I know that vector search is something suitable here, vector databases are something suitable here? Let's start talking about that.

Simon Eskildsen [15:59]:
Yeah, absolutely. I love rooting all of this as an example, and I'll try to bring it back into the vocabulary we were using. So let's design a search engine that searches through all this data. If I search for "object storage," right, this episode will come up, and the relevant points are going to come up when we talked about that in either episode. The way you would do this, right, is that you first have to make a decision about what my chunk is called, right? A chunk is what information am I feeding into the model to get a point in the coordinate system, get the vector or the embedding? We have to decide what that's going to be. We could maybe take the entire episode and turn it into a point in the coordinate system. That might work, right? That's basically, can we fit the entire episode inside of a context window? But intuitively, it would be very difficult to represent an entire episode about a vector database in a, say, 1,024-dimensional space, right? That's a very, very good compression algorithm if we're able to do that. So we might go a little bit lower granularity. That would be very cheap, right? Because you have 100 episodes, so 100 vectors. We're not talking about that much data, 6 kB, 100, right? So we're talking like less than a megabyte of data. So that would be great. It's very simple. We could just do that in memory in JavaScript in the browser. But the search is not going to be great. If we break it down and trade a little bit more complexity here for a better search experience, we might say that we're going to chunk based on, let's say, every 30 seconds, right? So we say, okay, every 30 seconds, we're just going to do a hard cutoff and then create an embedding for that. That's probably going to be a decent representation. How far into vector space, if you intuit it, how far into vector space can you really make it and color the whole coordinate system in 30 seconds? So then we would take an episode like this, right? And let's say it's 60 minutes, then you're going to have 120 embeddings, still a very manageable amount of data, right? So 120 times 100 episodes times 1,024 dimensions. Let's say we're storing it all as four-byte integers. That's like 50 megabytes of data. So we could, again, it's manageable in memory or something like that. It's a very manageable amount of data, and we could build probably a pretty good search engine on that. If you ship that with turbopuffer, frankly, with that few vectors, you could also do it in memory. If you can search 50 megabytes very fast, right? You can search memory at this point at 20 to 30 gigabits per second on a fairly small machine, so you could do that, and we would get pretty good search results. At this point, we could start to, again, ratchet up complexity for a better experience. One way that we could improve the performance then is that right now, the default way is that you take the query. The query might be "turbopuffer," and we search for "turbopuffer," and then we turn that into an embedding, and then we do the query to the coordinate system. In this case, we might just exhaustively search through all the vectors of all the episodes and then return the top results. For 50 megabytes, we might be able to do that in tens of milliseconds on a small machine, so it's not too bad. If we had 10x that amount of data, like 500 megabytes, we're starting to run into latency on this, especially as we get into gigabytes, but it would work. It would just be slower. What you can then do for that is you can build an approximate neighbor index. It's called ANN. What we're doing there is we're doing a kNN search, so we're comparing the query vector to all the other vectors and finding the top 10 results to show on a search page. An approximate nearest neighbor index is one where we trade off a little bit of precision or recall, it's called, by building an approximate kind of geometric index of this representation. One way you can do it, the simplest way to do it, is that you take all the data and you try to cluster it. You take every point and you try to see, okay, in this conversation, right, we're talking a bit about databases, we're talking a bit about vectors, we're talking a bit about Shopify, we're talking about this and that, and some clusters of conversation might emerge, right? These 30-second clips can now be set into these clusters. To make the search faster, we could imagine that you search for "turbopuffer," and it looks for a cluster maybe on databases based on this conversation, and it searches only that cluster. It's not searching the Shopify cluster; it's not searching the whatever. I mean, we're going to talk about databases the whole time, so there might be some sub-clusters in that cluster, but intuitively, you can imagine there's a banana cluster, right? There's an animal cluster, and there's a computer cluster, right? You only search the relevant one. That clustering you do with an algorithm, you use something called k-means, right? That sort of groups content together to do it optimally is kind of an NP-complete problem, so you have to do some approximation. Now, this is not always going to work, right? You're going to search for something that straddles the boundary between some of these representations. If you're searching in music, there may be artists that are not pop, not rock, but they're somewhere in between. There's some approximation with how many clusters you visit. You can imagine that you might have a vector index with a million vectors, and you might have, let's say, a thousand of these centroids, right? These clusters that are in the content, and you have to choose how many clusters that are adjacent, that are closest to your query vector you're going to look at. That's one of your trade-offs in these approximate nearest neighbor indexes. There are also other ways to represent it, but this is probably the simplest one. We could do that to speed up the search, right, for your podcast search website. The other thing we can do is we can try to improve the search results. What we might find maybe is that the 30-second boundary is not great. It doesn't represent itself well. Sometimes we're changing topics a lot. You might try to find a way to represent an embedding that is sort of at some normal conversational margin. This is a podcast, so you're kind of asking me questions, and then I might give embarrassingly long answers, but this is not going to—you can't bet all of that because I'll change what I'm talking about all the time. So you might maybe use an LLM, maybe a very fast, cheap LLM to figure out what the boundaries are and then do embeddings on that. Maybe you find another way; maybe you're always doing it based on sentence. There could be all kinds of ways of deciding what the embedding should be to improve the relevance. Another thing you could do is that you could try to incorporate other ranking signals to improve the relevance, right? You could do things like, "Ah, I'm going to weight podcasts that are more recent higher." You might do that. You might also combine it with a text index because maybe the embedding model doesn't have any chance to know what "turbopuffer" is, and so it thinks that "turbopuffer" is maybe a fast fish, but really it should be categorized in the database cluster, and you might get bad results as a result of that. Full-text search is going to be great because it will know to just match directly on "turbopuffer." Maybe we give that a higher relevance. This is kind of the curse of search, this iterative loop of, "Okay, am I improving search results?" You have some test cases; you look at whether they're better against what you would know, what you would do knowing all the data. This is kind of like that constant push and pull between, "Am I going to put more data in the database, and how do I index it faster?" That's the part that something like turbopuffer takes care of, and then you take care of knowing your data, your users, how you want your search to be structured, improving all these different signals via turbopuffer's APIs, or another vector database or search engine in general you're using. We don't think of turbopuffer as just a vector database. It's certainly a prominent feature, and we're very good at storing vectors, but we can also do full-text search and other things to help you get good search results.

Kaivalya Abte [24:21]:
Okay, I think that's a good mental model, like how we should start thinking about the problem. When we really face that kind of problem, for example, for this kind of data, I don't think that we really need a database. As you said, we can also do it in memory, right? But then you have to also keep tuning about the search results and how you are planning to do the chunking, and probably you want to create more clusters. Then you can think about making it faster and so on. I think this example, if we start thinking about a pretty high scale, for example, let's say now I want to build it as a service for all the podcasters, right? I want to onboard them, so this massively increases the scale. There are many other things coming into the picture because the data size increases, the type of podcasts and the discussions would increase, not only technology but maybe something else as well. Depending on how you want to provide the search, how you want to chunk the episodes, how you want to add more clusters, I think that all adds up to the complexity and the solution we should be aiming for, right? Maybe for this podcast, it's fine to just do it in memory, but maybe if we do it for all the podcasts, we want to have many more podcasters on this platform, then it becomes a big problem, right? Something where we need to focus on also on cost-effectiveness and stuff like that.

Simon Eskildsen [26:04]:
One thing, just before we move on, is that you might decide to use a search engine even if you have a small amount of data just because it provides the APIs to do some of these other things like full-text search. You don't have to do all these things in memory. That would be one reason. If you only have a couple hundred vectors, turbopuffer also just exhaustively searches; we don't index it at all. Sorry, go on.

Kaivalya Abte [26:31]:
No, I think that makes sense. The decision has to be based not only on the cost aspect, not only on what do you want to provide, but maybe if some database provides you a simple solution, just using that and getting a mix of vector search and full-text search, I think that could also help. But as you said, let's increase the scale of the problem, right? There are so many podcasts, many types of topics, different lengths, different guests. Probably there are also specific guests that people want to search on, right? For example, if I have a group discussion, right, not just one-on-one, but maybe there are more people in the discussion, so people would also want to search when someone like Simon talked about this thing, right? So that adds up to the type of clusters we want to create and the index that would work on. So let's increase the scale of the problem, increase the variety of the data, and then start thinking about the problems that I might get when the latency spikes or the size of the data grows and the cost increases. So how should I think about that?

Simon Eskildsen [27:52]:
Cool. We can walk up the order of magnitudes here. I think that's an interesting exercise. So yeah, let's say at the 50 megabytes, which would probably be what the vector representation here would be. First of all, that's much more than the text of all of this, which you could compress well. If you go up in complexity and you start to search more data, you might have millions or... In the millions of vectors, most databases can do a really good job of vector search, right? If you have a Postgres instance, Postgres comes with a number of different extensions that you can find that are open source to do vector indexing. Honestly, that's great. If you just have a couple million vectors and the features in Postgres are going to get you there from a relevance point of view, that's probably simpler than introducing something else that you have to replicate data into. Where we start to see our customers seeking us out is when you start to get into the tens of millions or even into the hundreds of millions. That's when building these indexes and storing all the data and querying it, especially if you're doing it with a WHERE clause, starts to get tricky. Let's talk about the WHERE clause for a second because you asked about that. What if I want to search for a particular author, or maybe I want to search for a particular date range, or I want to search for a language? Now your vector space is not just what's the closest thing; it's what's the closest thing that also matches this filter. It's not just about what's similar, but what's similar that also has this hard constraint. That starts to get very tricky. We have a blog post on our website where we talk about what we do here. At a high level, you can think about it as let's say that we've tuned this to know that when I search for something, I look at the three closest clusters, and I search for everything that's in those clusters. By default, let's say that those clusters are of size 1,000. I'm just making up numbers here, right? That means that every time we search something, we're looking at around 3,000 vectors to try to get a good, accurate result, right? For kNN, we searched at everything, so we know that that's 100% accurate. For an ANN index, often they're 95% or so percent accurate or more, which means that for say the top 10 results that you get from an exact search and the top 10 results you get from the approximate index, 95% of the time, basically there's an overlap. Actually, yeah, on average, right, across everything. Some are going to be 80; some are going to be 100. That's what recall means in that context, and that's a very important thing for any vector database. When you have a WHERE clause, maintaining a high recall of 95% starts to get more tricky because imagine now I search for those three clusters, and they have 1,000 vectors each, but only 500 of those vectors match the filter. Now I'm only looking at 1,500 vectors every time I do a search query. Is that really enough to get 95% recall? Generally not. You probably want to still look at 3,000 as you were in the other case. Now I have to look at more clusters, and I have to keep looking at clusters until I've looked at maybe 3,000 vectors. These are the things that make ANN indexing very tricky because you have to tune this for the dataset, and you have to tune this for basically how rare is the filter. This one was one that cut 50% off, but if you cut 90% off, maybe you have to look at more clusters, or maybe you can come up with another clever way of navigating the clusters. Maybe you have another index where you execute the WHERE clause, and then you just look up the clusters and know how much matches. There are lots of different ways that the query planner here has to take into account to provide high recall with the filtering. That's the kind of stuff that we talk about when we talk about vector databases and why a new category emerged. I think that category will turn and turn from vector databases into search engines. But this is the kind of thing that sometimes Postgres will have trouble with. Postgres will make some naive decision and not have a query planner that's completely integrated into the high recall use case. We see some cases of this. You can get away with this by just looking at more data or pre-filtering and things like that. But in general, this starts to be challenging. The other thing that gets challenging when you get into the tens of millions or hundreds of millions of vectors is that the vectors are large and do not compress well. If you store it in Postgres, right, you are kind of by definition in a relational database. You are going to store it on the writer, so the main replica, and then you will often store it on two replicas, right, into two read replicas off of the writer. This means that you're storing the data three times. The other thing you're going to do is that no one in their right mind is going to run a database at a 90% disk utilization, right? That's just too scary. You might run it at, let's say, 50%. You might be comfortable with that. So now we have to buy the actual disk to store the data on, and those disks, like let's say an EBS volume or a PD in GCP, cost you somewhere between 10 and 20 cents per gigabyte, depending on what's underlying this disk. NVMe is a little cheaper, maybe 8 cents, but it has its own challenges because as soon as you restart the node, it's gone. You really want to know what you're doing before going to those. With the network disks, 10 cents per gigabyte. Let's run down to that. But because it's 50% disk utilization, you're actually paying 20, and you have three of them. This means that every gigabyte of data you store, you have to pay for it three times, and then you have to multiply by two because of this utilization. So you're paying 60 cents per gigabyte of data stored. On object storage, you pay two cents per gigabyte. Of course, with vectors too, you're not just storing the vector once. You often store the vector multiple times. You have to do all these things where sometimes you store it not just in the cluster closest, but in multiple clusters and things like that. Often these indexes get even larger than just storing the vector multiple times, which is why something like Postgres and the economics of it start to at some point not make sense. At some point, essentially, if you think about two things that are optimized for, it's the cost and operational complexity and then the cost and then the operational complexity of having a different database. At some point, those graphs cross, and you may decide that you want to abstract it out. Since the dawn of time, something like Postgres has had full-text search features. MySQL has had Sphinx and others. But often at this company's scale, this becomes one of the things that gets uneconomical. It becomes difficult to manage inside the main database. Keep it in there for as long as you can because it's simple to manage. But there comes a point where babysitting that becomes more difficult than operating a separate search engine outside of the Postgres universe. You might also be ready for some of the features to enhance your search that are not in there. So then it becomes a question of that and also of indexing, right? Building an approximate nearest neighbor index is an expensive activity. If you have a 32-core Postgres cluster or 64-core cluster and you're building a 100 million index, that could consume all of your CPUs while you're doing that. I love the DBAs at Shopify, but I don't think any of them would let me run ANN on all of those hundreds of thousands of shards, right, to build ANN indexes at that scale. That's kind of the trade-off, right? But I would encourage people to use, like if they can get away with the features inside of something like Postgres, I would really, really encourage it. Those are expanding over time. But you can't really outrun the economics, and they catch up to you quick because the vectors are so large.

Kaivalya Abte [35:31]:
Yeah, that's interesting. I think it also adds up because, let's say you use Postgres and you release a feature that, oh, now you can do this, but then you always want more features, right? Your users want more features, and they want to search more. They want to also filter more, and then it's always going to grow. So even if it works now, maybe it won't work, let's say, in six months or one year down the line, depending on the data size, of course. You mentioned a few interesting things there, right? One, like building these indexes is a really expensive operation. So I want to understand how that is done. But before we do that, let's say now I decide to use a vector database, something like turbopuffer, for example. When I use LLMs, I create all the embeddings, and I give it to turbopuffer. What exactly happens then? When is the index created? The other question I wanted to ask was about recall, but probably after this question.

Simon Eskildsen [36:45]:
Yeah, I mean, simplified, when you write a vector to turbopuffer, we put a file into object storage in our write-ahead log, right? Every new entry makes it in there.

Simon Eskildsen [36:56]:
Then we return an OK once we've committed to object storage. That's our fsync. If you've committed 10 vectors into a turbopuffer database, we call these namespaces. You can think about them as just the prefix on S3. When you have written, say, 10 vectors and you query it, we're just going to load them all into memory, right, or stream them in from S3, then into the disk cache and into memory. Doing anything else, write 100 vectors, we'll still exhaust the search. At 1,000, we'll probably still exhaust the search, but once you get into many thousands or tens of thousands, then we'll start building the index. The indexing nodes are going to get a notification to take that namespace, that prefix on S3. Let's say it's "podcast/," and it will go get "podcast/" and then it will start building the ANN index. We're not using a graph-based algorithm on turbopuffer for this; it's always going to be something centroid- or cluster-based, like I described before. There's a lot more tricks to it, but that's the indexing we'll do. Then we'll start building that index. When you've done a query on turbopuffer, it will look for inside of that prefix whether there's, let's say, an index file. If there's an index file, then it will go and look up the vector in the index. If not, it will just exhaustively search and apply the newest WAL on top of the index for that namespace. Yeah, that's how the writes work.

Kaivalya Abte [38:30]:
Interesting. So it automatically decides that, oh, this is the right time I should build indexes because the data is growing, and the user doesn't have to do anything about, you know, "Oh, create index," something like that. So that's not there. It automatically decides. On the query side, how does it know that for this particular query using an index would be better? Because that may also depend on the query itself, right?

Simon Eskildsen [38:57]:
Yeah, so that's the query planner's job, right? By default, with ANN searches, it will always use the index if it's there.

Kaivalya Abte [39:04]:
Mm-hmm.

Simon Eskildsen [39:05]:
Exactly. It uses the index, and I mean, there's multiple indexes, right? If you're filtering, then we have to build non-ANN indexes for that and kind of interleave them. The query planner has to look at the statistics of the data distribution and make decisions about how it's going to do that to provide high recall for the query. But if there's an index, we'll always use the index.

Kaivalya Abte [39:24]:
Okay, interesting. What are the other types of indexes that are typically useful? How does it decide based on the nature of the data or the queries it has seen, or how does it decide what kind of index would be suitable? Or maybe multiple indexes, and the query planner can decide to use this index and use that.

Simon Eskildsen [39:48]:
There are multiple types of ANN. Again, these are the approximate neighbor indexes that are in use today in production. I'd say in broad strokes, there are two that are most used. There are the centroid-based ones. That's what turbopuffer uses because it works best for object storage, and it works really well for lots of writes for reasons we can get into if you want to get into that. Then there are the graph-based indexes. The graph-based indexes are phenomenally fast at queries, right? Where a centroid-based index might take a millisecond, a graph-based index might take 100 microseconds, right? But everything has to be generally in memory for a graph-based index, so that's the downside. If you're doing 100,000 queries per second and you want to do it in a single node, it's very hard to beat a graph index. But very few people are doing that.

Kaivalya Abte [40:36]:
Mm-hmm.

Simon Eskildsen [40:36]:
That's why we think the centroid-based approach is very, very good for a lot of workloads. The graph-based indexes, if we just get into it a little bit, you can imagine a graph-based index as you try to build a graph over time where vectors that are close in vector space are also connected in the graph.

Kaivalya Abte [40:55]:
Mm-hmm.

Simon Eskildsen [40:55]:
Then essentially, you can imagine that you start in the middle of the graph, and then you look at, okay, this graph might have, say, 64 other nodes or vectors that it's connected to. I'm going to go and look at the one that I'm closest to and perform a greedy search. Then I have a priority queue of what am I now closest to. I keep navigating the graph for some hyperparameter-tuned number of steps and size of the priority queue, and then at some point, I spit out the result. This is really fast in memory, but the trade-off is that when you navigate a graph, you have to do a lot of round trips, right? It's like I go here, then I have to go get this data, then I go to the next point in the graph, then the next point in the graph, and every time I do that, I have to do a round trip. A round trip to memory is fast, right? It's in low hundreds of nanoseconds. If it's in the CPU cache, it's in single-digit nanoseconds. This is very fast to do 10 or 20 of them. But if you're going to a disk, a random read to an SSD could easily be in single-digit milliseconds, even for an NVMe SSD. For object storage, of course, we're talking hundreds of milliseconds for every round trip. The way that something like turbopuffer is designed, it just doesn't fundamentally work well with a graph. You can do things with a graph to shrink what's called the diameter of the graph to try to minimize the number of jumps you have to do, but you start to lose a lot of the benefits of it being a graph to begin with.

Kaivalya Abte [42:17]:
Mm-hmm.

Simon Eskildsen [42:18]:
The centroid-based approach, right? We're back to the coordinate system. We're back to finding information that's semantically grouped and clustered, right? We have the rock cluster, the pop cluster, and the electronics cluster, right? Data that naturally groups together will also naturally work when we then search for, I don't know, electronic dance music, and then it goes to the electronic cluster and searches just that. Centroid-based indexes make use of that, and they work really well for object storage because we just have to do fundamentally two round trips, right? We have to store all the centroids somewhere. The centroids are the average of all the vectors in the cluster, and we just search all of those. We find the top k closest clusters, and then we go download all those clusters. We're downloading a lot more data than we do in the graph, but that's what disks and S3 are very, very good at, is downloading a lot of data. This is also called IO depth, right? Basically, this fan-out needs to be very, very high. But if you have everything in memory, fundamentally, it could be a bit slower than doing a bunch of round trips, right? Which is the fundamental trade-off between these two approaches. But it works very, very well for object storage and for disks to do the centroid-based approach.

Kaivalya Abte [43:28]:
Interesting. Do you also do this in parallel? For example, I imagine that when you say, "Okay, search this cluster and the other cluster," these are again S3 files, right? If you have to search, let's say, a given search space, you already know that. You could also make parallel calls to S3 and make the search results be faster, right?

Simon Eskildsen [43:57]:
Of course.

Kaivalya Abte [43:57]:
Yeah.

Simon Eskildsen [43:57]:
Yeah, we do exactly that. I mean, we go way beyond that, right? When we search for something, we do all these S3 requests. But even the S3 requests, often you want to—if a block is larger than 4 megabytes, you want to do multiple block fetches and things like that. There are lots of optimizations you can do to try to saturate the network. When the bits come back on the wire from S3, we just transpose it, project it directly into memory. It's completely zero copy, right? We can get it as fast as possible with as few memory copies as possible. The only copy should be the copy from the network buffer right into user land, which would be io_uring and so on. You could even get around that. That massive concurrency is really good for S3, and it happens to be also very good for modern disks. A lot of software today, including all the Unix utilities like tar and so on, do not take advantage of the fact that you can do so many concurrent reads to a disk—a ridiculous amount, right? Millions per second, and very little software today takes advantage of that. You have to remember that these kinds of SSDs only became generally available in AWS in mid-2018. Not a lot of databases fully take advantage of something that was only GA in the clouds that late.

Kaivalya Abte [45:14]:
Interesting. That's very interesting. The other thing that you mentioned, and I want to understand how that would work is, you know, so basically you may end up searching everything or you may end up searching parts of the cluster depending on if you already have the index. You used a term called recall, right? You also want to meet a certain level, let's say 95%. How do you compare? How do you measure that the recall is not going down, and where is that happening?

Simon Eskildsen [45:50]:
There are two ways of doing it. There is the academic way or the CI way where you generate a bunch of datasets, and then you generate a bunch of queries. For every query, you do KNN search, so the exact search, right? The 100% accuracy search that's slow, and you generate the correct result for, let's say, the top 10. Then you run your approximate queries and then you compare the result and you see, okay, on this query, it's 90%. It's 90% match. On this one, it's 80%, and this one is 100%. You average the results; you got the recall. That is the—I mean, every vendor does that, right? Everyone runs that as part of their probably their CI or when they make changes to how their ANN indexing works. They run this on a bunch of different benchmarks. So that's table stakes. If you can go beyond that and generate your own datasets, you can do it with filtering, right, which is the difficult thing. It's not that hard to do ANN search, but it's harder to do ANN search with WHERE conditions and filters. Then you need datasets around that. I make sure that recall is high for that, and you can do it against an exhaustive search. So that's one way we do that. Of course, we are also very humble to the challenge of production where we don't control the data distributions. Academic datasets tend to have nice data distributions and not always have the messiness of real-world data, where you might have a million duplicates and things like that, and you still need high recall. What we do is that in production, for a percentage of queries, we run recall in production. We have a fleet of nodes, and all they do is just run exhaustive searches against ANN queries to see what the actual recall is. Inside of our Datadog, we then have grouped by all the different organizations. We can see what their actual in-production recall is. So that when we tune the indexes and shift changes to make sure that this is meeting a high-quality bar. Recall is very tricky to get right on every single query, which is why we do this empirically. You can imagine that there are some datasets, for example, where this is something we come up against a lot. We run a POC with a customer; we share the recall numbers, and maybe the recall is, let's say, 92%. They're like, "Oh, we want it to be like 90%." We look at the queries, and it might be that they have a lot of queries that don't have a good result. If you search for text, if you search for "turbopuffer" in a dataset that doesn't mention "turbopuffer," you will just get zero results, and you will be happy with that because there is nothing about this thing. Even if there is something about vector databases or there is something about fast fish, you're just not going to get it, right? And that's fine, right? We've reached the limits of full-text search. But for vectors, there is no such thing as a zero result, right? A banana is similar to a computer in some way. An apple is maybe more related to a computer than a banana is, but they're related in some way. When we analyze recall on some of our customers' datasets, often it's because they do what would traditionally be a zero result. But because everything is related to everything else in vector space, there's no such thing as a zero result. You can intuit that if you search for something in a coordinate system, and nothing is close to it, and all the clusters are really far away, it sort of ends up being a bit randomized what results you get, especially in a centroid-based approach. In a graph approach, often it fares better on these results that don't matter, but on a centroid approach, not so much. We see a lot of limitations of recall in that way as a general measure. It's a really useful signal, but there are cases where it doesn't fare that well because it doesn't encapsulate this kind of thing. In real life, we would never see this in academic datasets because often not in all of them anyway, but in a lot of the academic datasets, they do have good query distribution against the actual data, but in real-life datasets, people will search for things that are not even close to what's in the dataset.

Kaivalya Abte [49:58]:
Interesting. Yeah, that's a very good point. Of course, it's very challenging to observe this thing, and you cannot probably do it on each query. Talking about going a bit further into the database, like you said, you have S3 files, and if I remember correctly, last time we discussed about the particular format that you use, and I think you use the Rust library rkyv for the storage format. Is it the same format that you have for indexes and the data, or is it something different?

Simon Eskildsen [50:37]:
Yeah, I mean, we do intentionally have some differences in how we store in the index and the data. Part of that is just safety, right? If, God forbid, we caused a problem in how the data is indexed, well, then we have it in a different code path for where it's written into the WAL. It's a safety thing that they are similar, but they both use rkyv. rkyv is a Rust library for a zero-copy serialization format.

Kaivalya Abte [51:18]:
Interesting. Cool. Talking about, so this is, so we discussed how the data is getting ingested, how files are getting created in S3, and then how indexes and when indexes are going to be created. What does it give me back? Is it again like a set of vectors? Because whatever I am ingesting in the database is basically only embeddings, right? How do I convert back the embeddings into something meaningful?

Simon Eskildsen [51:53]:
Yeah, so generally what you do, right, is that if you have a chunk of text, again, a snippet of a podcast episode in raw text, you send that to OpenAI or Voyage or Jina or one of these companies that specialize in creating embeddings and training models for it, and you get back the vector, right? A bunch of floating point numbers. Typically, what we see our customers do is that they will send the raw text alongside or the raw image or whatever the raw entity is alongside the text. Because we store it in object storage, you can then say, "Well, I don't need filtering on this; I just need it to be stored alongside," and give you a heavy discount because it's stored inside turbopuffer on object storage like that. Generally, you don't try to reverse engineer it; you just store the canonical data, and there's inevitably going to be some loss of precision where you can't—I mean, it will be a little bit like taking a JPEG, downscaling it, and then expecting it to be able to upscale it again, right? You are compressing it in some way, even though it is often larger than the original text, right? For images, often it's smaller, but for something like text, it's generally bigger, but it's lossy compression.

Kaivalya Abte [53:03]:
Okay, so what I get back is already the original data, right? Because you're storing it closer in the vector database.

Simon Eskildsen [53:12]:
Yeah, alongside the vector, yeah.

Kaivalya Abte [53:14]:
Okay, so that's really powerful also in the case of turbopuffer because anyway, the storage is all in object storage. Even if I give the entire data there, the size of the data is not going to increase the cost because I also have the data, and I also have the embeddings and stuff like that. Of course, the data doesn't grow with the type of queries that you're doing, like the number of vectors you're going to create and so on. So I think the data is still lower as compared to the vector and the indexes that you're storing, right?

Simon Eskildsen [53:50]:
Generally, yes.

Kaivalya Abte [53:51]:
Right. Okay.

Simon Eskildsen [53:52]:
There's cases for everything. You could do embeddings of massive pieces of text, but in general, yes, exactly.

Kaivalya Abte [53:57]:
Interesting. That's cool. What are some of the interesting use cases that you have seen, you know, people using turbopuffer and using it for a combination of vector search and full-text search? What are some of the challenges that you have seen?

Simon Eskildsen [54:19]:
Yeah, so we can talk through some of the case studies of some of our customers, right? We've got Cursor as one of our customers, and Cursor, of course, if you use Cursor, you'll have seen that sometimes it draws context in from the codebase. You might search for the code that does the clustering, and then it knows that, oh, it's the k-means code. It's in this file that we don't have open right now. That's what Cursor can use to have the agents and the edits and things like that, drawing context from the entire codebase. So that's one way that they use it. They trained their own embedding model where, you know, they take the code and they chunk it up and then encode it with their own embedding model. Aman, one of the co-founders, shared some really impressive stats about this embedding model that they're training. That's one use case. Another use case is Notion. In Notion, if you're inside of Notion, you have the ability to use AI to start documents, ask it questions about your data, "What's my company's policy?" or "Can you draft a document about this for our weekly meeting?" From that, it needs to draw in context from your Notion workspace, and so they can create embeddings of your entire Notion workspace and then store it in turbopuffer and then enhance the elements of that, right? So those would be some of the case studies of some of our customers that people might know.

Kaivalya Abte [55:51]:
Yeah, absolutely. I think everyone might have used Cursor by now because it's so popular, and of course, Notion as well. While I was thinking about Notion and also some of the use cases, I don't know, for Canva, there's a lot of duplication, right? People are creating the same type of images and the same type of videos and stuff like that. Do databases like turbopuffer also take care of that to optimize that, like reducing the redundancy or the duplication, at least for the same customers, right? Maybe for different customers, you may not be able to do that, but for the same customers, is it also done?

Simon Eskildsen [56:38]:
I mean, there's two answers to that question. I think that if you know that you have a lot of duplicates, then you can work on it on your data, right? You can represent a document that says, "Okay, this is the image, and here are all the places that it occurs in the dataset," and do your own deduplication. That would be the customer doing it. If you have an image in, let's say, 100 Notion docs, we can't really know that. I mean, all those vectors are going to have different metadata, right? One is going to say, "Oh, I'm inside of this document," and "I'm inside of this document." So we can't really do much there. You might be able to do something in the ANN index where we compress it harder because we know the vectors are the same and they have these attributes. Those are the kinds of optimizations that a good ANN index will do when it encounters a lot of duplicate data. But it can be hard to generically do it across all attributes, right? But generally, turbopuffer is at a price point where we don't really want people to have to worry about these things, right? We want people to just pump in a lot of data, all the data they can possibly get their hands on, and make it available to the LLM that they want to query, whether it's through an agent Q&A or just general search for the end user. We don't want people to have to worry too much about that kind of thing.

Kaivalya Abte [57:49]:
That's fair enough, right? Because ultimately, the cost of data is what you are optimized on, so people may not have to worry about that. How do you make sure that, you know, since these creation of indexes and, I don't know, even search, it's a pretty expensive operation, right? How do you make sure that the different tenants or customers have enough isolation so that they don't have problems, especially when you have things on object storage?

Simon Eskildsen [58:29]:
Yeah, so object storage is by nature very good at multi-tenancy, right? There are certain things that you do on object storage to make it better at multi-tenancy. One of those things, for example, is that at the end of the day, what object storage does is that pretty much every database does some kind of sharding. It's just a question of whether it's automatic or not. S3 and Google Cloud Storage and Azure and all of them do the same thing where, you know, if you have three keys, A, B, and C, and A and B are really—or like C is really, really hot, and A and B are moderately hot, then it might decide to put A and B on some nodes and C on some other nodes, right? Continuously split the key space so you can do various things to randomize the key space as much as possible so that this kind of activity gets better and everything gets uniformly over the shards. But in general, S3 is very good at this kind of stuff; GCS is very good at this kind of stuff. We just have to help it a little bit on our side, right? Of course, when you query turbopuffer, you will get into a node that we think has cache affinity to that data. Those nodes could get hot, and so there are various things we can do there to split traffic across multiple nodes. We can make sure that tenants aren't impeding on each other by limiting the amount of CPU for different tenants. There are lots of different things like that you have to do when you run a multi-tenancy system. But the flexibility of this is great because suddenly you're ingesting a billion vectors in the course of over a few hours. We can scale that up, and we all share the compute when there's large spikes, so it works really well. This is one of the things my co-founder Justine and I spent an enormous amount of time on at Shopify, which is probably one of the largest multi-tenancy systems in the world, right? It's making sure that capacity is fairly shared, making sure that if one tenant drives millions of requests per second, which might often be bots, that other tenants do not get booted out or have some kind of degradation of service because of them. Prioritizing traffic for a tenant, so for example, traffic that has a cart is more valuable than traffic that doesn't, or whatever it might be. We can do these kinds of things over time too to make multi-tenancy really good for turbopuffer. At the end of the day, multi-tenancy really, like the main thing you can do is just scale up as much as possible, and turbopuffer can scale up very, very quickly. But at the end of the day, the clouds will only give you a machine in around 120 seconds or so before it's booted up, and turbopuffer can boot up very quickly. These are things that we'll get better at over time, but the system performs very, very well.

Kaivalya Abte [1:01:01]:
Very interesting. Talking about observability, right? Any database needs to have some metrics around how it is performing, query latencies, and index usage and stuff like that. What are some of the interesting metrics that are probably specific to vector searches and full-text searches?

Simon Eskildsen [1:01:22]:
Sorry, for monitoring?

Kaivalya Abte [1:01:23]:
For understanding the performance and if there is any problem going on and if you need to scale up and stuff like that.

Simon Eskildsen [1:01:32]:
Sure, I mean, the biggest thing we look at is latency, right? So we look at core latency. You look at indexing throughput, so how many tens or hundreds of megabytes per second can you index? You look at how long the cache is living, right? How many days does something take until a TTL is out of the cache? We look at raw throughput. You look at CPU and things like that, of course. You look at memory—very standard things. I think maybe the most domain-specific are just how's the indexing throughput and how are we keeping up with how fast people are ingesting? So writing into the system, how fast are we indexing what people are ingesting? We can index it often up to tens of thousands of vectors per second, so as long as you're ingesting at a lower rate than that, we're good. If you are, we allow spikes and bursts, and maybe we'll throttle you. Worst case, we can't keep up, so these are the kinds of things that we look at on the write side. On the query side, I mean, we have lots of tracing in place to look at core latencies and things like that. There are cold queries; there are warm queries, right? So we optimize both the cold queries coming straight out of object storage. We optimize the warm latency if the WALs make it to NVMe SSDs, which takes quite a bit of work to make work really well. Then you optimize the hot queries in memory that are often quite fast.

Kaivalya Abte [1:02:53]:
Very cool. Where is this? I understand that more and more databases, because of cost reasons and because of all the simplicity that it brings in, people are moving to object storage at least in some of the use cases. Where do you see this, you know, the trend of vector search and vector databases going? Where do you see turbopuffer will be used more and more?

Simon Eskildsen [1:03:25]:
Yeah, I mean, I think what we see our customers doing is that they want to vectorize all the data they have, and they often have a lot more data than they're currently vectorizing. That's the number one. The second is that we see them wanting to do full-text search, right? If you're searching in a dataset, then the embedding model might not know all the terms, right? I'm sure you might use terms with your friends or family or your partner that an embedding model has no chance of knowing what they mean.

Kaivalya Abte [1:03:54]:
Right.

Simon Eskildsen [1:03:54]:
Right? That's where full-text search is really useful. If you're searching for an opaque SKU, which is always some weird combination of characters, full-text search performs really well for that. We see the vector databases, and at least ours, more from being primarily focused on vectors to being focused on search in general, right? Allowing you to rank even if you're just doing vector search a file that you just edited or a document you just edited or a song you just listened to or whatever it might be more relevant. We have to help you with that kind of stuff. Typically, when you search, you sort of take millions or tens of millions or billions of things and comb them down to a couple hundred. That's what turbopuffer is good at. Once you have a couple of tens or hundreds of things, then typically you'll have a model where you give it the entire document or song—in your case, it would be just that snippet of text—and the query, and you kind of run the full LLM over it. You can imagine passing that to ChatGPT and then asking, "What is the most relevant of these ten?" and then you get a result back. We see our role in that to do that first stage retrieval, right? To find not the needle in the haystack, but find the needles in the haystack that are relevant to you. It could be an LLM; it could be a trained re-ranker or something along those lines. We see the trend here being that in the past, people are used to writing a giant JSON document to describe exactly how everything is ranked. We see that people are now writing, instead of writing big JSON queries or SQL queries, they're writing a simple JSON query or a simple SQL query to go and locate the 100 or tens of needles in the haystack, and then in their Python script, use some kind of re-ranker—sometimes called a cross-encoder—based on these documents you get back. This is very, very competitive to these very large JSON queries that we've seen in the past passing into often these Lucene-based engines.

Kaivalya Abte [1:05:55]:
Very interesting. Cool. I think, you know, in this one hour, we can only cover so much. But I think we have covered the basics and also some interesting things about how turbopuffer is able to provide these features and how it is cost-effective and what are the challenges involved. But the space is large, right? What would you recommend people read more in terms of understanding vectors and how they can build applications around search as a unified experience, not only vector but maybe also full-text and exact searching and stuff like that? Are there any resources you would recommend?

Simon Eskildsen [1:06:41]:
Yeah, I mean, there's a lot of papers on all these indexing algorithms. I think that honestly, ChatGPT is pretty phenomenal for this kind of stuff. I would ask it, like, "How does it work? What's a graph-based index? What's a cluster-based index? Hey, can you please give me a super simple Python or JavaScript implementation? Play around with it." That's probably the most effective way. If you have the patience to sit down with that. Our blog has some content; we would like more content, and docs have content on how these things work, how turbopuffer works, the architecture of turbopuffer, observability, and I think that will be a good guide as well. For full-text search, I mean, the Lucene source code is where I've learned almost everything that I know about this from. The docs are also quite good. I would recommend that you read the docs on how the data is stored. That's probably the most useful. If there's something you don't understand, get ChatGPT to reason through it with you. Probably Cursor on the Lucene codebase is going to be hard to beat for learning how full-text search works. There are books and things like that, but I think that actually going to the database with an inquisitive mind and some good questions and just reading the source code and quizzing it is going to be the most effective way to learn as much as possible about this. The other thing I'll say is to actually write the search application. That would be the first place that I would start because you're going to just ask better questions as a result. So build the thing that you're building, right? If someone should build that over your dataset and hand you the keys, that is going to be the most useful way to have the best questions to ask once you want to learn. So I would always start with the actual use case.

Kaivalya Abte [1:08:16]:
Right. Awesome. Well, Simon, thanks a lot for joining me today and sharing all the insights on how things are working in turbopuffer. I'll attach the links to the blog and all the interesting things that you mentioned so people can also refer. But as you said, building a search application and solving a use case is going to be the best use of time, and learning and reasoning with ChatGPT or something like that, that's going to be really powerful. Awesome.

Simon Eskildsen [1:08:51]:
Absolutely. Absolutely. Thank you so much for having me.

Kaivalya Abte [1:08:55]:
Absolutely. Pleasure.