Query prices reduced by up to 94%

Billion-scale vector search with Notion

May 29, 2025Data Council Conference

Transcript

Mickey Liu [0:00]:
So my name is Mickey. I'm a software engineer on the data platform team at Notion. My name is Simon. I'm one of the co-founders of turbopuffer, and we've worked very closely with Mickey and the team on this use case. Great. And today our talk will be about how turbopuffer scales. We have a billion objects under vector search running on object storage, and specifically we'll talk about how Notion uses turbopuffer as well. So before we begin, for those of you who might not be familiar with Notion, we essentially provide teams with a connected workspace that brings together all of your knowledge and work, and we've built Notion AI on top of Notion to help you find answers quickly via the AI search and to automate a lot of your work as well. Forbes even calls us the AI everything app.

To enable Notion AI search specifically, we built out our own RAG system that utilizes text embeddings stored in turbopuffer, which is essentially a search engine built on top of object storage. Using turbopuffer, we're able to provide users with searches and answers that are context-aware. Now before we dive in, let me just give you guys a quick demo. So here, as you can see, we have a very typical Notion workspace, LumiLabs, which has their content structured in the same way that a lot of our customers choose to use Notion. So on the left-hand tab, you can see that information is organized by departments. We have compliance, customer support, engineering, etc. And within every section, there are additional pages and pages within those pages as well.

So this workspace very much leverages the core Notion capabilities of storing your documents in essentially a tree structure. Say that I'm an employee at LumiLabs, and I want to find answers to some of my questions. For example, let's say my laptop broke and I need a new machine. Instead of having to sift through all the content within my Notion workspace, I can simply just ask my question to Notion AI. So, we'll come here and I'll just type out, "My laptop broke. How do I get a new one?" So Notion AI will take a few seconds to think about the response before generating what seems to be a coherent and actually informative response to me.

It says to get a new laptop, you'll need to submit an IT ticket, and here's how, and it even tells me the information I need to include as part of my IT ticket, as well as the SLA that I can expect. And as you can see, Notion also provides citations, which are essentially links to existing Notion documents that contain information that is used to surface the response. Here we can see that even though my input query, the raw question itself, doesn't necessarily contain the keywords that appear in the actual page content, Notion AI is still able to find out the exact or the most relevant information to answer my question. These types of semantic searches are made possible because of text embeddings as well as the vector search engine.

Let's go back. How does Notion AI search actually work? Well, the main components, like I just mentioned, involve using text embeddings, which are essentially numerical representations of text that capture meaning and relationships between texts. So here, as you can see, we have the words cat, dog, and car. The embedding model will be able to detect that cat is more semantically similar to dog compared to cat versus car, despite the fact that cat and car are lexicographically closer to each other. When all this data is put into a search engine, like a vector database, we can essentially issue queries that return the most semantically meaningful results using any sort of similarity algorithms.

Now, Notion's RAG interpretation is broken down into essentially two sides: the ingestion side as well as the querying side. On the ingestion side, we have essentially two sets of pipelines running, one called the bulk ingestion and one called the live ingestion. The bulk ingestion refers to the initial embeddings ingestion that happens when an existing Notion workspace chooses to enable Notion AI search. As a part of activating AI, we'll essentially take a snapshot of all the content that a Notion workspace contains, along with any other data sources that they want to include as a part of Notion AI search, such as Google Drive and Slack. We'll call our embedding model to batch generate embeddings for all that content and then insert it into the search engine. Along with the embedding vector objects, we also store metadata, which will be used for us when we run queries on searches.

Once the workspace finishes its initial onboarding process, the live ingestion process will pick up changes that are made to the workspace. These changes will come in the form of page additions, modifications to content, as well as deletions. The live ingestion pipelines, which run in near real-time, capture all these changes and then generate the embeddings for the new content and then upsert it into the search engine. Correspondingly, we'll also delete embeddings for deleted content from the actual vector database search engine as well.

On the querying side of things, we take a raw user input, which is essentially the question that I asked earlier as part of my demo. We'll do some processing, like extract some keywords, and then we'll pass the processed string into the embedding model to generate an embedding vector. We'll then take that embedding vector and run vector search against our vector database. As a part of the actual vector search query, we'll also include a set of metadata filters. For the use case of Notion, these metadata filters typically capture filtering logic on permissions because we only want to surface content to the end-user that the end-user has access to.

For every question, we may issue multiple queries to the vector database across the different data sources such as Notion, Slack, and Google Drive. Then we'll take the cumulative responses, do some processing like passing it to a re-ranker before we generate a response that is coherent and in a natural language form similar to what you saw in the demo. In terms of the scale that the Notion RAG system is working with, we currently have millions of workspaces that equate to over 10 billion chunks of text that we need to generate embeddings for. More concretely, as you can imagine, users of Notion do a lot of edits as well as page modifications to the content, which means that we are a very write-heavy system. Right now, we process about 15 billion embeddings on a monthly basis at a peak of 100,000 embeddings per minute.

The reason why this 15 billion number is greater than the overall 10 billion volume is because every page may be updated multiple times per month, which results in multiple embeddings that need to be generated and written into the vector database. On the read side of things, on the query side, we see about 100 QPS at peak. Hence, the challenge for us when finding the right vector database solution was essentially evaluating along these four fronts. The first being performance. We want to find a vector database that could handle our write-heavy workloads, but at the same time be able to provide very low latency queries. Obviously, we also want to be able to do this at scale, right? So as more workspaces enable AI on Notion, we want to be able to continue delivering this experience without breaking Notion's bank.

We also care very much about the overall recall quality, and we use metrics such as mean reciprocal rank and normalized discounted cumulative gain scores to be able to track the recall. Lastly, we want to be able to help drive the product roadmap so that we can continue unlocking new features for our customers with quick turnaround times. Fortunately for us, we discovered Simon and the rest of the turbopuffer team. Over to you, Simon.

Simon Eskildsen [10:21]:
Okay, perfect. Hi everyone. Yeah, I want to nerd out a bit on turbopuffer. The database that we built supports Cursor, use cases like Notion, and a bunch of others. We started working on this in 2023 and have built this basically object storage native search engine. Today we run, I mean, I don't know how many of the vendors are publishing numbers on this, but at a very, very large scale in the hundreds of billions of vectors and process peaks of more than a million vector writes per second, 10 gigabits per second peaks and things like that. So very high scale with this simple architecture here.

So just to root it before we start zooming in on some of these pieces, it's a very simple architecture where the request comes into the turbopuffer binary, and then we just start going through the memory hierarchies of the in-memory cache, the SSD cache, and then finally to object storage. The canonical storage for turbopuffer is object storage, and that's what makes it really special. So let's go into some of the thought that went behind the design of turbopuffer. A couple of years ago, it became evident to me that maybe there was room for a new type of database. I think sometimes databases are created around only one of these three dimensions. But I think if you really want to create a great database, you want to capture all of these three things at once.

The initial use case for turbopuffer is that I spent almost 10 years working on the infrastructure at Shopify, and after that, I was kind of bopping around helping my friends out with various infrastructure challenges. At one of the companies, we wanted to do some semantic search and recommendations, and this was a small company spending like three to four grand a month on their Postgres installation. We built a little recommendation algorithm on top of vectors in 2022, so these were very new at the time, and it worked great. The problem was that, in napkin math, the bill was going to be 30 grand a month. If you're paying three grand a month for your Postgres and now you have to pay thirty thousand dollars a month for just one feature, it sort of torpedoes your product roadmap.

That's something that we paid attention to at that company. We sort of just slated it and waited for the workload cost to come down, but I couldn't stop thinking about it. The workload here, not just for recommendations of vectors, I mean, this was more about the data type, but I saw a lot of companies that wanted to connect very large amounts of data to LLMs, right? It's exactly what Mickey and the team are doing. They have all this data stored in Notion, even their third parties, and they want to connect that to LLMs, right, with permissions and performance and so on. That was the new workload that we saw appearing. The new data type being vectors, right? You kind of chop the head off one of these LLMs, and out come these great vectors that are really good semantic representations of things that are similar in the real world or also similar by low distance in vector space.

But these vectors are enormous. If you have a kilobyte of text, right? A kilobyte long, say a Notion document, generally you're going to chunk it into paragraphs or something along some fidelity that you choose. Each one of these, say, four paragraphs for a kilobyte of text is going to turn into some 1536-dimensional vector. You multiply this out, and you end up with like 25 kilobytes of data. So now you have a 25x storage amp on the canonical data. That's why we end up in this situation where companies are looking at vectors. They're trying a couple of different things, but they're at such enormous scale where they just can't afford shoving all of this into their existing databases because the storage costs just don't allow for it, right? Coming back to the 3K, 30K example from before.

So you've got a new workload connecting LLMs to data, you've got a new data type that's enormous, and then from my perspective, knowing a little bit more on the database side where I've spent most of my career, there seem to be three things kind of in motion that might allow if someone really just sat down in the woods and thought about how to build a database in 2025, they might pay attention to three things. The first thing being that NVMe SSDs, they've been around for a bit, but they were only really GA'd in the clouds in sort of mid-2017. The way you write a database for an NVMe SSD is a little bit different from how you write them in the past. You can get phenomenal throughput to an NVMe SSD that's sort of within an arm's length of memory bandwidth depending on the disk, as long as you just issue enough concurrency to that disk, which most databases were not written for, and all of the Unix utilities that you're using don't even come close to utilizing these disks.

So most software today does not take advantage of this, and it means you could use less memory. It's great. The second thing, and this is I think right now probably also in the halls here at Data Council, there's a lot of buzz around building databases on S3 or object storage in general. I think the thing that most people don't realize was that S3 didn't become consistent until re:Invent in 2020. This is remarkably recent. What that means is that if you issue a write to S3 and you read it immediately after, you get that read consistency, right? That you know exactly what to predict. That's a really nice property when you're building a database. You can work around it with a really fat metadata layer and consensus layer, but if you want to build a database that is fully object storage native with no other dependencies, you really want this property.

The third, and to some degree most important just in how recent it was, this was only released a couple of months ago in re:Invent on S3. GCS and Azure had it before that, but you kind of have to define for the lowest common denominator here, is compare and swap. What this means is that in a traditional database built on object storage, you have this big metadata consensus layer to try to contend with multiple writers and consensus across all the files that you're storing in object storage. But with compare and swap, it means that you can put a metadata file there, do a little bit of massaging in memory, and then put it back and make sure that it didn't change in the interim, allowing multiple writers and multiple readers.

So you've got these three big things where you've got NVMe SSDs that require a lot of queue depth, you've got S3 that needs consistency, you've got compare and swap, and there's no database that's been fully designed to take advantage of this. It felt like a great moment in time because there's a new workload that could really benefit from this and a new data type as well. So let's keep diving deeper here. So let's start with the economics. The economics are interesting, right? For a company like Notion, that allows them to search way more data and up their product ambition if they can search more data.

In a traditional database, you know, the database, the relational database is designed in the 80s and 90s. Generally, what we're going to do is we're going to put all the data on the SSD on the writer, and then we're going to replicate it to two readers, right? We're going to have a copy of all the data on all of them. You can play tricks, but at the end of the day, it's going to be hard to escape the economics of around 10 cents per gigabyte for a disk in a cloud, right? You're not going to run those disks at 100% utilization. 10% you're not going to run it at that either, and 90%—like if you've ever been on call—is super scary, but 50% is maybe a nice thing to ballpark from.

So you kind of multiply that out, 20 cents per gigabyte of 50% just utilization multiplied by three, and you end up sort of like an all-in cost per physical gigabyte or gigabyte sort of about 60 cents, right? What traditional search systems allow, some of the Lucene-based solutions on top of Lucene, like OpenSearch and Solr and others, is that generally you have a model that looks a lot like a relational database where you sort of have shards that are replicated to another node. You can live with a little bit more risk here. If you have a writer and two read replicas, you really don't want to be in a situation where you fail over to the reader, the writer is completely trashed, you still want to have a replica; otherwise, it's just too much risk to take on.

But for a search system, generally, you can tolerate a little bit more risk and only have two replicas in total, so this sort of lowers the economics a bit. A lot of the very large-scale OpenSearch and other Lucene-based deployments you will see will run with this kind of replication factor of about 40 cents per gigabyte. So let's bring this back to building a database that's completely object storage native. The cost for something like S3, I use object storage and S3 here interchangeably, it's just shorter on a slide, is two cents per gigabyte is what we pay for storage on S3. Now we can put whatever is actively maintained into an NVMe SSD, and we can fully utilize that disk.

So now we end up with, if everything was in cache, about 12 cents per gigabyte. The blended cost is somewhere in between, right? Not every Notion workspace needs to be active and in cache all the time. There's some Pareto distribution that's idiosyncratic to the customer, right? Maybe Notion has a little bit more activity on their namespaces than, say, Cursor, right? But it's a different blended cost where the lower boundary of two cents per gigabyte and then the upper bound closer to 12 cents per gigabyte. So if we can take full advantage of this, we're lowering the cost here by a pretty respectable multiplier, right? If the blend of cost is more towards the inactive data on S3, we have a four or more than an order of magnitude reduction in the end-to-end cost here.

So why is this hard? If you're building a database that is taking full advantage of this and has respectable latency when all of the data is on S3, in a traditional database that sort of bolts on S3 later, what happens is that you sort of put the shard on S3, and then when it's cold, you download the entire thing and hydrate it into memory. This is pretty slow; it's going to be difficult to get more than in the seconds of latency, maybe tens of seconds of latency. But because you're just downloading so much data and then using all the memory bandwidth, maybe even disk bandwidth, depending on the implementation for it.

In a traditional database, you store a pretty significant portion of the data in RAM, right? If you have a two-terabyte disk and you fully utilize that, you might have some percentage of RAM that's sort of idiosyncratic to your workload, right? And that's sort of a nice DBA job to maximize the cost performance there. The nice thing about RAM is that the random read latency is just phenomenal, right? We can just go read a random place in memory, and it's 100 nanoseconds. If it's in the CPU caches, we're talking single low double-digit latency. So this is great. Most storage engines that have been built today, the open-source LSMs and so on, kind of don't really think too much about these round trips.

The metadata can just sort of like, I get this metadata, and then now I know I need to get this metadata and so on. It's not a big deal because it's just in RAM, and when the server boots up, it takes a bit of time. But for something like S3, each one of these round trips takes 200 milliseconds, right? It's a big difference between 100 nanoseconds and 200 milliseconds. So we really have to design a storage engine just to completely optimize for minimizing the amount of round trips because we can get a lot of data in 200 milliseconds, but it takes about 200 milliseconds to get it, depending on the size of the data that we're getting.

It's the same with a disk, right? On a disk, especially in NVMe SSD, you can get an enormous amount of data, but you just have to wait about 100 microseconds to get all those blocks that you issue at once. So you end up where you're looking at more data in each round trip, but you're minimizing the number of round trips. Suddenly, you can get away with using a lot less memory and using object storage a lot. So we have to design an LSM tree and a storage engine in general that's completely optimized around all of that. If you build a database with all of that in mind, you, of course, end up with strengths and limitations, right? And we have to make sure that these are strengths and limitations that make sense for a search engine.

The strengths are that the costs are great, right? If you have low, if you have like this cold-hot mix, you can get closer to in that blended cost of the two cents per gigabyte. Even if all the data is being used all the time, it's still much cheaper than having everything on three disks, right? Because we just back it with object storage, and so we have all the durability of one of the most durable systems on earth. It's simple. We have the S3 team, the GCS team, and the Azure Blob Storage team working for us here doing a lot of the hard work of doing this at very large scale.

Having—I was on the Shopify last resort pager for almost a decade—and for better or worse, it changes the way that you write software forever. Having simple stateless nodes that you can churn away at any point and upgrade that just acts as object storage and then having all the durability sourced there is a wonderful design. It takes a lot of discipline in the round trips, as we discussed before, but you end up with a very reliable and horizontally scalable service, and simplicity is, in my opinion, an absolute prerequisite for doing scale if you're trying to scale fast. The warm queries, there's no reason why they can't be as fast as something that's in memory all the time, right? You just keep hitting that node, and it will sort of inflate the pufferfish into the memory hierarchies, and everything will be in memory, right?

The write throughput is phenomenal because we don't have a metadata store or anything like that in front of it. So we just make use of the automatic partitioning of S3, GCS, and everyone else that will keep splitting the key space so we can get and serve these millions of vectors per second that are written into turbopuffer. But of course, there are limitations, right? We have to write this storage engine from scratch, which takes time because not every feature under the sun is available immediately. But fundamentally, the limitations are that once in a while, you will see a cold query miss, right? You will go to object storage, and we will do those range reads, and that will take in the hundreds of milliseconds, right?

We will try to do around three round trips where I get the metadata, get the index blocks, get the data blocks, and then maybe we have to get more data blocks depending on the query plan. So that can be slower, but it's not super slow; it's like 500 to 800 milliseconds depending on the query plan. It could be more if the index is really large, and you can work around this, right? In the blended cost, we can just go closer to the upper end of that boundary and just store more of this data and make sure it's always hydrated in cache. Another fundamental limitation is that there's higher write latency, right? We consider committing to object storage to be the new fsync, so you have to wait a couple of hundred milliseconds for every write to be committed.

If you're doing an OLTP-type workload where you're updating a lot of rows all the time and doing transactions, this kind of write latency is not really acceptable. But for a search engine, it's great. Typically, you're pumping updates into some kind of data pipeline, taking them out the other end, and then writing them in. Adding a couple hundred milliseconds of latency is not really a big deal, especially not when we can do this enormous write throughput in the tens or hundreds of gigabytes per second. So if we mold all this together, we end up with a very simple architecture and one that you might actually want to be on call for. If we kind of follow the query through the path here, the client will issue a request to the turbopuffer API, and it will go to the load balancer.

You will do that query to a namespace that's sort of, depending on the parlance that you're used to, it might be analogous to a table or a tenant, or for us, it's really just an S3 prefix. But you can have millions of these. We have 50 million of these in production. It's not a big deal for us to have a lot of them. So we consistently hash that namespace to one of the query nodes. That's where that namespace for this point in time is calling home and where the cache is most likely to be warm. If the cache is warm, we'll just serve the query out of memory and SSD. If it's cold, we'll go directly to S3 and do the range reads and minimize the round trips to try to serve that query as fast as possible.

The write path will go through and consistently hash to the same node. It will write into the write-through cache to make sure that the subsequent queries that come in don't have to do a cold query onto object storage. The write will flow directly through to S3. If you can imagine sort of like a simplified view, when you write into a namespace, there's a directory on S3 called whatever your namespace is called, and there might be a directory called the write-ahead log, the WAL, and then the index. The write-ahead log, you can imagine every time you do a write, it's like there's a file called 1, 2, 3, 4, 5 for every write that you do, and then the index sort of has to catch up to the data that you're writing, merge it into that vector index, the filtering indexes, and all these other indexes that we build to serve your queries.

It's constantly going to be a little bit behind, and then we can apply the WAL on top like a traditional database. Once the WAL has sort of progressed far enough from the index that was last maintained or updated, the query node will add that index to the indexing queue, which is also a queuing system built directly on top of object storage, and the indexing node, separating the indexing throughput and the query throughput, will start to pull off of there. This node pool is completely autoscaled, so sometimes we see huge re-ingestions from our customers updating embedding models or others. First, we can scale these pools to hundreds of thousands of nodes and then scale them back down to a steady state that might be a low multiple of tens.

This is a really nice separation of workloads. If you're used to doing search, you may have had to scale up a cluster, but because of the shard rebalancing, it's so slow that you just don't end up doing it. I talked a bit about this round-trip sensitivity earlier, and of course, we have to design the LSM tree to be very sensitive to the number of round trips and try to get as much data in each round trip as possible. On the left here, you could imagine that you have a bunch of these vectors, right? We simplify them down here to just being two-dimensional. In reality, there are hundreds or thousands of dimensions, and we plot all of these in a coordinate system.

We can imagine here that letters that are adjacent represent something similar. Maybe E and H represent red dress and burgundy skirt, right? These are things that are semantically grouped together. So if we want to do a vector search query here, you know, finding something that's related to one of these things with a query vector, we can exhaustively search through every single vector and then return the top whatever results back. The problem with that is that once you start getting into the millions of millions of vectors, it starts to get quite slow. If you have just a couple of thousand, tens of thousands, or hundreds of thousands, like constantly talk to prospective customers and just tell them to keep exhaustively searching memory because it's very simple and it's very easy to maintain.

But as you get into the millions, you want to start thinking about building an approximate nearest neighbor index. An ANN index will not be 100% accurate. It's 100% accurate to go through everything, but there's not a known way to do something 100% accurate that also cuts down on a bunch of time and trades it for write time. There are two solutions that we see at scale. Broad categories, there's a myriad of children within each one of these forks in the branch here, but the graph index is one that's very popular for workloads that need to do an enormous amount of QPS and where you can afford to store all the data in memory.

In a graph, we start to build up basically a graph where points that are adjacent to other points in vector space are also connected in the graph. You may have heard of things like HNSW and DiskANN, and they're all sort of variants of this doing various different trade-offs. This isn't HNSW—it's much more sophisticated than this—but it illustrates the general concept of doing a connected graph over the vector space. To do a query on a graph index, let's say we're searching for something that's close to N, we go in and we compare our vector to A, and then we might find out B is a little bit closer, and F is a little bit closer, and we start navigating the graph.

The problem with this is that if this is a cold query, that's like 200 milliseconds, 200 milliseconds, 200 milliseconds, and even on a small graph like this, it could be six round trips, and so now you're above a second for even something very small. In reality, you're going to start connecting more edges and do all these different things, but there's limitations, and this creates a lot of write amplification when you're building these indexes. So the graph index is really problematic not just for object storage but also for disks because a lot around search where you're not fetching that much data is not a very efficient workload also on disk.

So turbopuffer uses a clustered index, and at a very high level, again for simplification here, what you do with a clustered index is that you start to divvy all the data into semantically grouped clusters. Maybe cluster number one here is dresses, and cluster number two is pants, and cluster number three is shoes, right? And within that, you have different things that fit that category. In the query time, all you have to do is compare your query vector to the different centroids. The centroid is the average vector of an entire cluster. You run something like kNN to make all these clusters. Now you can do two round trips. You download all the centroids, so you pick that file off of S3, look at all the centroids, find the closest N clusters to your query vector, download those clusters, and compare.

So now we're downloading more data, we're maybe looking at more vectors, but we're really leveraging that high throughput that we can get in each round trip from disk and from object storage. In the end, we get pretty good performance. If we look at the cold cases first, where we're going directly to object storage, if we have about a million or so vectors in the namespace, we can get pretty good performance, right? Trying to keep this about three to four round trips depending on the query plan. These are completely cold with nothing cached. Of course, as soon as you do one query, we start to cache metadata and things like that to cut down the number of round trips.

On a warm query, we can be as fast as any other solution. Generally, we find that for a search workload, anything in the tens of milliseconds is quite acceptable, and actually, the majority of this time is to go to object storage and figure out what the latest write is and then making sure that we're applying that to the query for fully consistent reads. In turbopuffer, if you turn off consistent reads and we don't have to do that metadata round trip to object storage, all these warm latencies go down by an order of magnitude. If you look at our production traces, it's basically a span that's this long going to object stores to figure out what the latest file is and in this much time actually doing the vector search query, the full-text search query, or the aggregation or whatever operation that you're applying on turbopuffer.

Mickey Liu [33:03]:
Cool. Thanks, Simon. So as of late last year, Notion has been running our whole production scale embeddings on turbopuffer. Up until now, we are on track to save around a few million dollars annually on our vector database, which is an amazing win for us. We've also managed to work with Simon and the turbopuffer team in terms of reducing our query latencies, specifically around larger namespaces, and currently, we observe about a peak of 50 to 70 milliseconds across all queries within Notion. In terms of what the future looks like between Notion and turbopuffer, we expect to continue working with Simon and their team to bring new features such as full-text search to enhance vector search results, as well as adding additional data sources that you can plug into your Notion AI in addition to Google Drive and Slack.

So I want to obviously thank everybody for listening to our talk today. We'll be here to answer any questions that you might have, and also feel free to catch us later at the office hours in the afternoon. Thank you.