Jason Liu [0:04]:
So, you know, I think one thing that people always talk about is doing vector
search, right, doing vector search at scale. And when I think of, you know,
databases that are truly being used in production, I mostly think of
turbopuffer, right? They're the tool that Cursor, Linear, and Notion all use in
their backends to do search. And, you know, I think in conversations with those
folks, it's always been the case that whatever provider they were using before
was very expensive, you know, very slow. I think in the past year or two, people
have realized that turbopuffer has been a great solution, especially because
it's backed on object storage, right? It's a tool that gives you full-text
search, vector search, and as we talked about earlier, you know, filtering and
aggregations. And these are all the ingredients you need to do search really
well for the context of RAG, but also I think in the future a lot is going to
happen in the context engineering realm with things like facets and aggregates
allowing you to give context to the language model to make more search queries
in the future. And today's guest, Simon, is the CEO of turbopuffer, previously
leading a pretty big engineering team at Shopify. And we're going to talk about
some of the design choices that he's made, right? How do you think about
billion-scale search, you know, bringing some real case studies from the
companies we mentioned before and how we think about things like tuning recall
and latency. And so as always, if you have any questions, please ask them in the
slider link that we shared on Zoom chat, upload the questions you want to see
answered. And with that, Simon, take it away.
Simon Eskildsen [1:31]:
Awesome. Well, hello, everyone. So when I proposed this talk, I called it a
billion-scale vector search on object storage. Slightly revisited here now.
turbopuffer is at a trillion scale now of not just vector search, but search in
general on object storage. Today, I want to talk a bit about the sort of the
founding story of the company, like why did I decide to build the first version
of this and then bring on my co-founder and the rest of the team later on to
productionize this a couple of years ago. We'll talk about how people use it and
why they've decided to switch over from other solutions and other storage
architectures to this more novel storage architecture. We'll talk about the
storage architecture that makes turbopuffer special, and it's the first
completely object storage native database that exists and why this type of
architecture is so suitable for this era of search needs. turbopuffer at a
glance is a search engine on object storage. It can do semantic search, so
classic vector similarity, and we can do this on trillions of vectors that we
have in production today. It can do full-text search, so it's not limited to
just semantic similarity, but can also do traditional BM25 search, and you can
combine those two types to do hybrid search, which many of our customers today
are doing. But turbopuffer cannot just do semantic search and full-text search,
it can also do aggregations and group bys for facets that Jason and I were just
talking about earlier. What makes turbopuffer really special is that it has a
new storage architecture. And in this storage architecture, all of the data is
by default stored on S3 or GCS or Azure Blob Storage, one of those three, and
then all of the caching is put in front of it. So it works a little bit like a
JIT compiler where the more you run it, the more you query it, the faster it
makes it up into the cache hierarchies. It's literally the reason for the name
is that the pufferfish inflates, right? All the way from deflated in object
storage where queries take maybe around 500 milliseconds into disk where they
take tens of milliseconds and into RAM where they can take less than 10
milliseconds. So that is the general architecture of turbopuffer that makes it
special and we'll go into that in a second. So who's using turbopuffer? We have
lots of customers that we've worked with very, very closely to productionize and
make them successful in production. Cursor was the first customer of
turbopuffer. We've been working with them since they were a small team in 2023
and grown alongside them. And we feel we've grown really tremendously with them.
Notion, Linear, and others also work with us. Superhuman recently went to
production as well. We work with one of the top AI labs and many other
customers, hopefully some of which you know and use. And there are many other
logos you don't always get the rights right away, but this list will continue to
grow. So I want to talk a little bit about why you build a new database, because
as many of you are likely aware, there's been search engines since the 90s,
right? You've had Sphinx, you've had Elasticsearch, you've had Lucene. There's
lots of different search engines that are all great for particular types of
workloads. I fundamentally believe that if you are building a new database and
you're going to make it, then you need two key ingredients. The first one is
that you need a new workload. You need a reason for people to adopt a new
database in their stack because it's a serious commitment to adopt a new
database. It's a very sticky product, it's difficult to migrate, and the feature
surface area can be something that is very varying across the different vendors.
So you need a new workload. And the new workload today is to connect LLMs to
enormous amounts of data, right? And in some instances also users to a lot of
data. But connecting LLMs to new data is really the newest workload, right? Like
in general we talk about RAG and that's literally what RAG is. One thing that's
also very interesting about this is that the new workload in particular is very
large, right? If you have a kilobyte of text, it easily turns into, after
chunking, into four vectors of say 1024 dimensions, which means that you have
now 16 kB of vectors from one kB of text. So it's really, you know, inflating
the size of the data. We call this space storage amplification or size
amplification in database parlance, and some people even use dimensional vectors
that are much larger than these in cases to get more recall, more precision.
These are enormous and it means that one company, the company that I work with,
where they were spending $3,000 a month on their Postgres instance and they
wanted to vectorize all of their unstructured text and put it into a vector
index in another database and it would have cost them $30,000 a month when their
Postgres was costing them $3,000 a month. Which is prohibitive for their use
case and it has been prohibitive for a lot of other use cases around the world,
because when you do this workload, you want to earn a return on the product that
you can ship without the economics really becoming important. So the second
thing is that you don't just need a new workload, you also need a new storage
architecture, because if you just have a new workload, then you can tag it onto
an existing database, right? The reason why say this is not a great fit for
something like Postgres and others is just because of the enormous amount of
data that this is to store on a traditional storage architecture, which we'll
get into in a second. Some workloads are suitable for this, which are, for
example, vectors. You need a new storage architecture to make it economical so
that companies can earn a return on it. But some are not. For example, when
geosearch really blew up with mobile in the early 2010s, that didn't need a new
database because coordinates are tiny, right? And they fit fine with all the
existing databases. But vectors in particular require a new type of index and a
10x cheaper storage architecture can really unlock new use cases.
Simon Eskildsen [7:35]:
The storage architecture that we can have today that we couldn't have 10 years
ago and that no other database is designed to take advantage of in the same way
that turbopuffer was when it launched sort of builds on three features. The
first one are NVMe SSDs. These SSDs are incredibly fast. They are only about
maybe five times or four times slower than accessing memory if you use them
correctly and you drive I/O for a lot of throughput, but they're about 100x
cheaper than DRAM. So if a database really can take advantage of NVMe SSDs
instead of DRAM, well then you can drive really good economics. The second thing
that you needed for the new storage architecture that turbopuffer has is that
you need S3 to be strongly consistent. What this means is that if you put an
object on object storage, then you should be able to immediately read it back
and you know it's the same object, which you can probably guess is a very nice
primitive if you're building a database to avoid having a whole other
coordination layer. And then the third thing you need is for S3 to have compare
and swap, which they only launched in December of last year. What that means is
that you can get an object mutated and then write it back and be guaranteed that
it wasn't modified in the interim. This allows you to build a database that is
completely object storage native and has the economics that we'll go into in a
second. So the architecture that you end up with is something fairly simple. You
have a client, it connects to a binary. That binary is a database and it
accesses a cache. It has a tiered cache into RAM when the pufferfish is fully
inflated into SSD and then finally into object storage. That is the overall very
simple architecture of turbopuffer that it's been since day one. Let's talk
about the economics because this matters because I said that you need those two
raw ingredients to build a successful database company or why there's room for a
new database business. RAM is expensive, costs about $5 per gigabyte, you know,
give or take with some error bars, but around $5 per gigabyte is a pretty
reasonable amount to pay per gigabyte of RAM. Databases don't run entirely on
RAM unless it's fully a cache because, of course, you shut down the machine, the
data is gone. So generally, you store it on three SSDs, three copies, because
even if one machine goes away, you still have two other copies. That costs about
$0.60 per gigabyte of data stored. This is how most traditional databases like
Postgres and others will store data. But if you store the data just in S3, well,
it's $0.02 per gigabyte, right? It's like 20 times cheaper than storing it on
three SSDs yourself. And if you can then take the data that's in S3 and cache it
with an SSD cache in front, you're still a lot better off. But maybe not all of
your data is even active so that a percentage is always in S3 and a percentage
is in SSD and a percentage is in RAM and you're paying exactly for the
performance characteristics that you need as the pufferfish sort of inflates on
the subsets of the data sets that you need the most. So these are improvements
that are an order of magnitude better on real production workloads. And we see
some of our customers migrate and cut their first bill from their last bill with
their past provider by up to 95% by realizing this architecture with us simply
because their workload is a really good fit for it. We happen to think that most
workloads are the thing that you need to do, and I alluded to this earlier to
build a database that takes advantage of this is that you have to build an
object storage first database that is also round-trip sensitive. What this means
is that when you go to, for example, S3, the p99 on accessing an object on S3 is
maybe 200 to 300 milliseconds, depending on which object storage you use. But in
those 200 to 300 milliseconds, you can max out the network. You can get an
enormous amount of data back, but every single round-trip takes around that
long. This is very similar to how an NVMe SSD works; it's just a lot faster than
200 milliseconds. Every round trip is around 100 microseconds, but you can drive
a lot of throughput in every single one of them. But you can't parallelize. If
you parallelize across a lot of reads in RAM, it doesn't matter as much. Random
read in RAM is very, very fast, of course, sequential is faster, but if you
build a round-trip sensitive database, then it will work very well with S3 by
going, going, going, getting some data and then a little bit more data and then
serving the query rather than doing lots of round trips that are very slow and
not getting a lot out of the data. Works really well in SSD as well. So if we
talk about vector storage and building this object storage, round-trip sensitive
data, then we have to be very careful about how we lay out the vectors to do the
vector search. If you search in raw vectors and you just do an exhaustive
search, it becomes very slow because if you have a billion vectors, you have to
search terabytes and terabytes of data sequentially either from object storage
and then in disk or in RAM, which is very expensive. Even if all of that is in
RAM for a million vectors, it will take you hundreds of milliseconds. You can
maybe squeeze it down, but it becomes very difficult to do in an economical
fashion during a lot of queries. It becomes extremely prohibitive for doing cold
queries on object storage. A graph index is a very traditional way of doing
vector search. It became very popular sort of as the first wave of vector
databases arose. And a graph index is also not suitable for an object storage
first or round-trip sensitive database that works well for disk and works well
for an object storage. The reason for that is that when you navigate a graph,
you sort of get dropped in the middle of the graph, and then you navigate the
graph by edges. Every single time you do that, it's 200 milliseconds at the
start, then 200 milliseconds to get to the first layer, 200 milliseconds to get
to another layer, and so on as you navigate the graph. That's very good in
memory because you're not reading that much data and the latency is very low,
but for disks and for object storage where you're much more sensitive to the
number of round trips versus the amount of data per round trip, it's not a good
architecture. Then there is this is actually the first really the first wave of
vector indexes even before they became popular a few years ago, which are the
clustered indexes. What you do in a clustered index is that you try to do a
natural grouping, right? You have a clothes cluster, a food cluster, whatever
the semantic grouping is, and then you put those adjacent on disk. So if you
think about it very simply, you can think about it as on S3 there's a file
called 1.txt, 2.txt, 3.txt, and 4.txt. Then for every one of those clusters, we
take the average of all of the vectors in the cluster and we create another file
called centroids.txt or cluster.txt. Now we only have to do two round trips to
serve the query. We get all of the clusters from clusters.txt, find the closest
say two or three, and then we download just those files, right? So imagine that
at a much larger scale, we can now max out the network NIC to get all the
centroids, then max it out again to get the clusters, but only do two round
trips. So we can look at a lot more data with this kind of architecture, and it
works very well for disk as well. And it happens to also work great in memory.
Um, okay, there's some chat here. I'll answer those afterwards. Of course, every
database comes with trade-offs, and we try to be very transparent in what the
trade-offs are because I spent the majority of my career on your side looking at
these databases all the time. And the first question is always, what are the
trade-offs? What are the limits? And what is it suited for? What does it do and
what does it cost? The trade-off for an object storage first database is that
cold queries can be slow. Once in a while, you'll hit a server that doesn't have
the data in cache and you have to go to object storage. And no matter how much
we optimize it, it's still going to be hundreds of milliseconds, maybe half a
second for that first query while we then start hydrating the cache, inflating
the pufferfish for that particular set of data. You can mitigate that and still
make it cheaper because you have a very cheap canonical source of truth for
$0.02 per gigabyte that you could always have it in cache and it would still be
much cheaper than any other storage architecture. But once in a while, you will
have that cold query. The other limitation is that you will have high write
latency, right? Every time you write to turbopuffer, you write directly to
object storage and you cannot beat that. That's going to be 100 to 200
milliseconds of latency. And there's some economics with doing very small writes
that are a little bit unfavorable to some workloads, but it's not something we
see very much. Those are the fundamental limitations. They trickle into some
other limitations like certain types of transactions being difficult, but really
these are the only fundamental limitations with this architecture. It means that
doing something like a very high transactional workload like Shopify, where I
worked before, doing some transactional workload like that would not be suitable
to do a checkout system on a database like that. But it is extremely suitable
for indexing lots of data to search or letting an LLM search, because it's low
cost, it's very simple, right? We're not, this storage layer, as the hundreds or
thousands of whatever amount of people work on S3, working on that for us, and
same for GCS on the GCP side. These are some of the most reliable, horizontally
scalable, and durable systems on the planet, so we can focus on the indexing and
the database itself. Warm queries can be just as fast as is in-memory database
once it's in cache, and we can get extremely high write throughput and give this
serverless experience where people don't have to think about node types, how
many servers that are on, things like that, because that's really an S3 problem
that was solved a very, very long time ago. And this gives us this advantage
where we see write peaks of 10 million plus vectors written per second, and it
works great because we can scale as far as S3 can scale, which is, I mean, we
haven't found the limitations yet. So the architecture that you end up sort of
building around this really also just as simple as you can imagine. Every time a
query comes in, it gets routed to the query node that is responsible for that
subset of the data with a consistent hash. And then if the query is cold, we
will go to object storage, take about half, take a couple of round trips to get
the data and serve it directly to the user. It might be noticeable, but it's not
going to be so slow that you're sitting in front of the search box for a long
time. Probably still be faster than a lot of searches that you find on a lot of
websites. As the cache hydrates very quickly at a gigabyte or more per second,
then the queries get much, much faster. For some of our customers like Notion,
when you open the Q&A dialogue to work with your data, they will send a request
to turbopuffer to start hydrating the cache for that particular namespace so
that the subsequent queries are fast. Lots of companies can have this kind of
pattern where they have a hint that you're about to access the data and then the
cold latency goes down even further. On a write, you go to a query node as well,
the same query node, and then we just write it directly into the cache to
increase the probability that the new data is also in cache so that subsequent
queries are faster. One of the things that's unusual about turbopuffer's design
is that all the reads are strongly consistent. So once you've made a write to
the database, it's immediately available on the next query with strong
consistency guarantees, which is also the kinds of guarantees that S3 operates.
We think that this makes systems more predictable and easy to reason about. You
can turn this off for faster, better performance, but that's the default. When
you write, we write into the namespace directory on S3. You can think about a
namespace as essentially a prefix on S3 or a table or something along those
lines, and we write into the write-ahead log. The write-ahead log is essentially
just, you know, 0.txt, 1.txt, 2.txt for every write that you're doing, and we do
a bunch of batching to save on costs. Once we've written enough data to update
the indexes, then we will in the background use these indexers to compact and
move the full-text search, the vector index, the attribute indexes, and all the
various indexes that we build for the data to keep it fresh. The query nodes
will then reach for the new indexes they get built and sort of page those new
keys into cache as they get queried. The performance of this ends up being
really nice, right? When the pufferfish is fully deflated, we still get really
good cold latency. We can get as low on the P50 to almost 200 milliseconds for
full-text search workloads. And the P99 even is around 500 milliseconds or 600
milliseconds. This depends a lot on S3 and all of their caching and things like
that. But we do a lot of things in the background to do hedging and so on to try
to keep this latency down as much as possible. When it's in cache, it's as fast
as many other systems. The majority of the time here in the variability is
really that for every single query, we go to S3 and make sure that we have the
latest data, right? Again, turbopuffer doesn't have any other metadata or
anything like that. So we have to go to S3 or GCS to get the latest commit to
make sure we serve consistent queries. If you turn this off, all this latency
goes down, and I think even the P99 is probably less than 10 milliseconds if
you're okay with eventual consistency. I think we can talk about this if anyone
gets into it. Let's cover some of the case studies here and then we'll turn it
over to Q&A in about five minutes. So Cursor is one of our use cases, and I'll
talk a little bit about how they use turbopuffer. So with their previous
solution, they were using an in-memory vector database, which is really the
first generation. In-memory and always in-memory makes a lot of sense even for
something like Shopify where you're going to have the entire catalog being
queried all the time. Well, you might as well have it in-memory, and the
economics of turbopuffer is still a lot better, but you might still be able to
earn a return even on it being in-memory on a more traditional storage
architecture. But for something like Cursor, right, not every code base is
active all the time, right? At any point in time, some percentage, I don't know,
like 1%, 10%, I don't know what the real number is, but let's say some
percentage of the code bases are active at any point in time, right? Those can
be in memory or on SSD, and then the rest can be in S3 or GCS or wherever this
code base is stored. So every single code base then in Cursor is just the prefix
in S3, right? That can be tens of millions of these at any point in time, right,
before they get GC'd out. So this pufferfish architecture really lends itself
very, very well for a Cursor code base where as soon as you open a code base, we
can start hydrating the cache for the namespace, and then all of the RAG that
Cursor does will become faster. If you use Cursor today and the agents and
things like that, you will see that it often does semantic queries like, "Hey,
we're in the code base that does this." And that's using turbopuffer behind the
scenes that Cursor is keeping up to date that uses embeddings and re-ranking and
so on to draw the right context in. So this helps Cursor both optimize their
inference by finding the relevant context and as little of it as possible, but
also find things that can be very difficult to find with grep or even letting
the agent grep around. So in my experience using Cursor's agent, it's very good
at these kinds of tasks where an agent might need a lot of attempts to grep
because it can find it right away with the semantic index. Cursor has their own
embedding model and they are very good scale. When they moved to us and moved to
our storage architecture from the previous provider, they cut their cost by 95%.
One thing that excites me more than cutting the cost for our customers is for
them to realize the most ambitious version of their product. And in Cursor's
case, this allowed them to index much, much larger repositories than was
economical for them before. The other thing was that before on the traditional
storage architecture, they had to be very careful about which servers had what
code bases and do all this bin packing. With turbopuffer, they don't have to do
that because it's horizontally scalable into as many namespaces as you want.
Notion is another customer of turbopuffer, and they save millions and millions
of dollars when they moved to our storage architecture. It's also very suitable
for Notion, right? You have a lot of workspaces and some subset of them are
active at once, and now you can realize those economics. They have more than 10
billion vectors, they do really large write peaks, and have millions of
namespaces for all of their data. One of the things that I really liked was that
once Notion moved to us, they removed all the per-user AI charges, and they've
been really, really good partners to us. The last use case I'll show here is
Linear, another one of our customers. They were dealing with Elasticsearch and
pgvector before, and they wanted something that was really hands-free where they
can just pump in all the data. They didn't have to worry about it; they didn't
have to think of machine types, and they didn't need anyone to operate it and be
on call for it. And they got the cost reduction, but that just made them more
excited to connect even more data into the LLMs. They really think about us as
this foundational search layer. I think every SaaS is now expected, in the same
way that 10 years ago we expected all of our SaaS to ship a mobile app. Well,
today we expect every SaaS to have a semantic search. We expect them to have
some kind of a generative mode research. We're going to expect a baseline of AI
features as these SaaS platforms evolve, and for Linear, they really thought of
us as the foundational search engine for all of that. With that, exactly on the
30-minute mark, I will hand it over to you, Jason.
Jason Liu [30:00]:
Oh, you know that. I mean, you're free to continue any other slides. I saw some
pretty interesting slides in the appendix, but let's jump into some questions.
Jason Liu [25:38]:
Yeah, I definitely sort of see the case for these cost optimizations. Maybe not
in 2025, but I remember in 2020, I was talking to some companies where they were
using some sort of called like legacy vector search systems, and it was the case
that they could not go GA until they moved to turbopuffer. And those kinds of
stories is kind of what really caught my attention earlier on in those days.
Just looking at some of these questions, I feel like a lot of these are actually
just comparisons against other search systems. People are really curious about
tools like Elasticsearch and Qdrant. I think, you know, generally when I think
about things like turbopuffer, I think about the fact that, you know, Notion,
Linear, Cursor, they have these really well-defined partitions. There's like
workspace, there's a repository, you know, but what would it look like for
turbopuffer to power something like a Twitter or e-commerce where there's just
no natural partitions here, or maybe there are? I'd love to hear your thoughts
on that.
Simon Eskildsen [26:23]:
Yeah, so the partitioning was really a go-to-market move, right? Every startup's
weapon is focus. And our focus is that the only thing that scales even now is
sharding. I learned that at Shopify; we shard on shop for everything. And so we
felt that a really good way into the market was let's give people unlimited
sharding and give them a really good experience with lots of many shards, right?
That was Cursor and Notion. When we used to not be very good at large shards,
right? If you were in 10 million, we would actively tell people, "No, go use
something that's been around for a while. Go use Pinecone, go use Qdrant,
something like that." We are not, that is not our ICP. Now we're very good at
that. We have customers that run in production that have namespaces that are in
the hundreds of millions. We're working with customers on basically building
Google, right? Like they want to search 100 billion documents all at once. And
that's what we're working on. We have customers that are searching 1 billion
plus documents, and we are getting very, very good at this. The trick here is
that when you search, and this is the same in Elasticsearch, any system that's
scaled does some kind of sharding, right? To use multiple machines. The trick is
you want the shards to be as large as possible. A small shard is like when I ran
Elasticsearch at Shopify, we were targeting our shard sizes at around 30 to 50
gigabytes. And so if you have a data set that is in the hundreds of terabytes,
which billions of products easily is, you have to search, you know, M times log
N or whatever the complexity is of your searching. That's not actually the
complexity of an inverted index, but let's say it's log N. Well, if M is very
high, you're spending a lot more computational resources than making the M
smaller and the N higher. So you want the largest shard sizes you can. And
that's what we're trying to work on right now to get very, very large individual
shard sizes. So in order to do 100 billion, you have to run many indexes, but we
will continue to make those as large as we possibly can. I think in terms of
implementation of this for very high throughput and in-memory, Qdrant's
implementation seems like something that some of their customers that I've
spoken to have really good success with. I think where it starts to get
difficult to maintain an HNSW at that scale is if you have a lot of churn in the
data and you're doing very high write throughput. So we'll get there; it's not a
fundamental compromise in our architecture. We will get good at it. I think
we're going to get exceptional at it, and I think the results of the POCs right
now are quite good. To compare us to Elasticsearch, it's really that traditional
storage architecture, right? I've been on call for Elasticsearch. It's probably
the worst database I've operated in my life, and part of this company is my
vendetta against it. So I do have some bias, but it's probably gotten a lot
better since I worked on it almost 10 years ago. But it has a more traditional
storage architecture, right, of two or three SSDs. You run those at about 50%
utilization, and disks are smaller, so it can be difficult to realize the
economics you need, right? At the end of the day, an infrastructure decision is
a set of trade-offs and then economics you can earn a return on. So if you have
a per-user cost of your Elasticsearch cluster, whatever cluster you're using, of
$10 per user and you're charging them $20, well, that's not a good return,
right? And so what we want for you is to increase the ambition of your product,
index more data, and have a better return so you can get to the gross margin
that you need to get to.
Jason Liu [29:57]:
Yeah, great answer. I definitely like that a lot. I guess one question that came
from Adam actually is around cache controls. The question is, what kind of cache
controls do you hand over to the user, and how much tuning typically goes into
achieving these costs and performance goals?
Simon Eskildsen [30:12]:
So by default, we don't really want you to think about it. So for example, if
you use S3 or GCS, you can turn on automatic storage tiering. What that means is
basically if you don't access the data for a while, then it just goes somewhere
else. You access it again, then it goes to a lower storage class. That's how we
think about turbopuffer as well. So we want to default to really good cache
behavior. We don't want you to have to think about it. We don't want you to have
to configure a namespace to be cold or warm. If you don't have that many
controls, because generally our users don't need it, there's very specific edge
cases like this case I'm talking about about searching 100 billion web
documents. Well, if you're doing that, then, yeah, we're going to work with you
a little bit on the caching before we know the heuristics to get this right at
that scale because it's a difficult bin packing and cache problem. But for most
other workloads, the default behavior is phenomenal. The main control that you
have is that you can send a hint_cache_warm request to turbopuffer. If it's
not in cache, we will charge you one query and start hydrating the cache, and if
it's already in cache, it's free. That's the main trigger that you have today,
and it works great.
Jason Liu [31:20]:
Oh, I wasn't aware of this one, but one of the questions is, will you remove the
minimum spend requirements at some point for these smaller use cases?
Simon Eskildsen [31:31]:
Yeah, it's a good question. So I think that if you have a very small use case
and you already have a Postgres database, you should like you could probably get
away with pgvector. I think as we mature our product and think that we can
provide you that we're less of a knife and more of like, you know, a drawer of
tools to help you do your search, then I think that it makes sense to do this.
The main reason we have the minimum spend requirement is because we really want
to give people a really, really good experience. We take it as an extremely
serious commitment that you are trusting us with your uptime and scaling a
support team and an on-call pager and all of that to be extremely responsive if
there's any issue, even if it's not by us. That's why we have that minimum spend
requirement. It is not an infrastructure minimum. It's nothing like that. It's
really just to guarantee a good experience, and we expect to lower that minimum
over time. And it's not that we won't have a free tier ever. It's just not the
right choice for us right now to have a free tier and support it with the
high-quality support that we've come to pride ourselves on.
Jason Liu [32:37]:
That makes a lot of sense. I guess this is the question I'm also curious about
because I noticed this in Cursor sometimes, which is that, you know, maybe for
Notion documents and for Linear tickets, there's not much editing of these data
objects, but you can imagine in Cursor, if you change a file very quickly and
make another query, how do you think about refreshing the index and how should
we think about designing such a system that maybe is like a little more write
optimized?
Simon Eskildsen [33:02]:
Yeah, so I mean, first off, turbopuffer is very write optimized. It's partially
very write optimized because Cursor does a lot of writes and Notion does a lot
of writes. I think there's a couple of angles to talk about this question from.
So the first one is that, and this is a realization that many of our customers
have made. And the first time it was explained to me, it's like, "Oh, yeah, this
makes a lot of sense." If you're doing full-text search, you kind of almost want
to re-index on every keystroke because, you know, the exact string changes on
every keystroke. That matters for full-text; the semantic meaning of a chunk
doesn't change on every single keystroke in the same way, especially if you're
using hybrid search. So we've seen some customers that, because creating the
embeddings is often much more expensive than storing it in turbopuffer, right?
So we see customers that find some compromise that makes sense to them. They do
the vibe check of like how many characters, how many bits, what's the added
distance before we have to re-embed, and the semantic needs captured enough, and
then debounced by some time, right? After some time, you always make the change.
That's very common. I think it's a very interesting observation. The second
thing is just that there is an economical piece to it, right? Do you want to do
this like all the time or do you want to do this every minute or how is your
pipeline set up? It costs money to keep these ANN indexes up to date, and so you
have to make some reasonable compromise here that you can earn a return on. I
think we're very sympathetic to users that want to do a lot of writes, and so
the system is heavily optimized for it. But it's a choice made by the users, and
it's not something that we really provide any particular constraints on. Lots of
our customers do it close to every keystroke, and it's generally driven by the
economics of creating the embeddings.
Jason Liu [34:51]:
Next, on the topic of embeddings, this is a question that a couple of folks
upvoted, which is, is turbopuffer optimized for late interaction models like
ColBERT, where you're storing multi-vectors per chunk?
Simon Eskildsen [35:06]:
So it is and it isn't. It is in the way that one of the largest challenges with
late interaction models is what I, you know, this is like my pet peeve today,
apparently. It's very hard to earn a return on because it's an enormous amount
of data. It is the best or some of the best precision that you can possibly get.
You can do it with turbopuffer. I can share a gist or I can send it to you,
Jason, at like kind of a pastebin of how to do it with turbopuffer, right? But
like fundamentally, all it is is that for every token you send a top K of let's
say a thousand. You do like, you know, like let's say you have 10 tokens in a
query, then you do like all of those. You could do a multi-query turbopuffer,
you get that back, and then you do a second layer of queries. You don't have to
get all the vectors, and then you emulate interaction results. The question is
whether you can live with the economics of it, right? You can squeeze the
embedding small enough, you F16 and all of that, and you will continue to
optimize the economics. I've had that pastebin around for six months. I haven't
seen anyone who's put it in prod yet. I think someone here should put it in
prod, and we would love to work with you because we have the multi-query
pattern; it's totally possible to do in turbopuffer. Now, what you need for
realizing late interaction in performance, and part of the reason why it's not
adopted, the math really just comes down to economics, right? How much are you
willing to pay for that extra 10 to 20% of precision that you can get with late
interaction over like some of the late chunking techniques and other techniques
that you can otherwise use? But if you have something where you have a small
amount of data and you just care about precision, I think you should try it, and
I think turbopuffer would probably by far be the most economical way of doing
it. But no, we don't have a, here's like all the vector to late interaction. You
have to string it together, right? But it's very simple to string together.
Jason Liu [36:57]:
On the side of economics with embedding models, one of the questions here is
around quantization. Does turbopuffer offer to do its own quantization
internally? How do you think about optimizations on that front? Or is it the
sort of developer itself who's responsible for writing these vectors up in
turbopuffer?
Simon Eskildsen [37:12]:
Yeah, so we do quantization and like clustering and all kinds of things, right,
to optimize the performance of your index. If you send us smaller vectors, then
we'll index the smaller vectors. At this point, we don't support like integer or
binary vectors, but we will. The smaller you'll be able to send them, the better
the economics and the performance is going to be. Because if you send us a, you
know, F32 1024-dimensional vector, I have to store that in full precision
because you want to be able to export that later on at full precision. So I have
to charge you that way, right? So if you can pass on something quantized, well,
then you don't have to store it at full precision one. So it sort of becomes a
bit of like, I guess, sort of an API problem. But the other thing is just like
getting all the SIMD instructions right for all the different architectures and
things like that for all the different quantizations is just what takes a bit of
time. And we find that most of our customers are very happy with the economics
of F16, but we'll continue to quantize smaller and smaller as our users request
it. But I think fundamentally the way that I look at it is that quantization, I
think quantization cuts across a number of dimensions. I mean, I've talked to
big embedding providers around this, and it's like roughly right. But it's like
what matters is like how many bits of information are you passing, right? And
one lever for that is how many bits you have per dimension, and then the other
lever is how many dimensions you have. And so whether you have an F16 that's
like 100 dimensions or you have a binary vector that's like, you know, many more
dimensions but then, you know, one bit per dimension, it's the same number of
bits that are there to compress and project a learned representation into. So
you should be able to fundamentally get away with an F16 that's a lot smaller if
the embedding model is optimized enough, and that's what we've seen from some of
the sophisticated users. They just train on F16 because it's extremely optimized
in the CUDA pipelines, performs really well, and then they do the truncation and
normalization with Matryoshka learning to get a really good representation. But
yeah, that's probably a way longer answer than they wanted.
Jason Liu [39:23]:
That was great. I feel like this is the first time, again, I've heard this depth
of technical talk. This next question is pretty good. So, you know, it might be
a whole session on its own. There are a lot of questions around turbopuffer's
facets and aggregates and how they're actually useful in practice, right? Maybe
in the legal context. Given that you are from Shopify, could you talk a little
bit more about like how maybe facets could be used in the traditional context
and then we can maybe extend into the agentic context?
Simon Eskildsen [39:55]:
Yeah, I think you should talk a bit about the agentic context because I know
that's something you're very excited about, and we share the e-commerce path, so
I'll talk about it there. I don't know if this is still the case, but actually I
was part of the team that shipped facets at Shopify, and we actually did it on
MySQL because not all the collections in Shopify were powered by search; they
were powered by MySQL collections, and then the search facets were different. So
there it's just a big select group by union; that's like really what it is. So
how are facets used? Well, I mean, in an e-commerce context, right, it's used as
like, okay, I'm doing this search and I want the different colors, so like blue,
orange, whatever, right? So it's really a select count from table group by and
then the facet, whatever you want to facet on. So you might want to facet on
color, you might want to facet on size, and then there are some really weird
ones like faceting on price because with price you want to roll this up into
some distribution that makes sense, and the cardinality gets very high. But
fundamentally, it's really about like when you go to an e-commerce site and you
do a search on that left-hand side, you want to have some relevant filters
popped up. I'm sure all of you have seen really bad faceting where it's like
they have every single price listed on the left side and like every size, and
there's like red, burgundy, maroon, and it's just like, no, I don't care; like I
want this normalized, right? The form of faceting that's in turbopuffer today is
like very simple, and it doesn't support every single one of these cases. Like
faceting on a vector search is like kind of funky because a vector query has a
result for every single entry in the data set, so you kind of need to like pass
a threshold and then facet on that. So you might want to facet on the full-text
search as part of the hybrid search, so it all gets a little bit funky, but
we're sort of expanding that with feedback. So if any of your turbopuffer
customers are about to be, you should drop that kind of feedback in the
community channel. But Jason, I think you should speak about it in the agentic
context.
Jason Liu [41:59]:
Yeah, totally. So I think, you know, one of the things we think about in terms
of like what is the difference between like traditional vector search versus
this like context engineering is the fact that in the context, we're trying to
give it clues, give the agent clues to make better use of tools in the future,
right? So you can imagine a simple file search might just be a text input, and
then I return 10 chunks. Right. But what if I can also go back and say, you know
what, of these 10 chunks, you know, actually 45% of the chunks that would have
been returned come from this file or this directory, right? If we include that
in the context and then we tell the agent, you know what, if you make subsequent
queries, we can return you full pages from a document, right? If we think this
is useful, we can return for you whole documents from a, take whole pages from
like a PDF, you might be able to load better data in the future, right? A simple
example, encoding agents could be something like using grep or using find. If
you make the find command, really what you're doing is you're kind of counting
the number of files or how many files have how many occurrences of a certain
keyword. And then the next tool you call is a read file tool that recognizes
that the find query has resulted in some file that is really relevant. Right.
And this is because we've designed a portfolio of tools with a range of
parameters that you can use to filter against and do all these other
interactions. And so by having more information in the search result in the form
of facets, we can then use those facets to make better searches in the future.
This is the same thing in e-commerce, right? Maybe I am searching for shoes. And
so my first query is just shoes. And then maybe I get two facets. Maybe I get a
brand facet and then I get a, you know, stars rating facet. And I realize, oh,
what I actually want is I wanted five-star shoes for Nike under the $50 price
point. I can click those three facets and make a new search query that's just,
you know, maybe, you know, find me running shoes filtered by Nike, filtered for
stars, filtered by price. And you could imagine a sophisticated agent could have
done this in one step. But the truth is, we didn't have that context, right? I
did not know there were so many Nike shoes. I did not know I had the ability to
filter on stars. I did not know what the reasonable price filter is. But by
providing that in context, you can whittle down your search, right? And this is
really going down to this idea that even with worse search tools, because these
agents are so persistent, as long as you give them more context and the context
can be used to call the tools in a better way, you're going to get better
performance. And I think that's one of the things that many vector databases
lack, which is the ability to provide additional context around data that wasn't
returned. And that's a pretty long-winded answer, but that's kind of why I'm
excited about some things like this, because Elasticsearch provides them, but
not many vector databases. I'm curious if you have any thoughts on that.
Simon Eskildsen [44:56]:
No, I mean, I think it makes sense. I mean, I think I had this in one of the
slides, but it's like every successful database ends up implementing every
query. I think one of the unfortunate things about the way that the query
language has evolved in Elasticsearch is that it's very, very difficult to use,
and it's very difficult to understand what's important. And that's one of the
strengths that we have, right? And in working closely with our customers and
feedback from people like you, Jason, is like, we can have some opinions in
terms of what we should prioritize for the context. And I like this; I like the
framing of I just want a summary of what's important about everything that
wasn't in the top K. That just like you as the database person know that could
be done in a reasonable latency estimate, right? And so I think that's great and
really interesting feedback.
Simon Eskildsen [46:34]:
I think in legal, I mean, I don't know anything about legal other than the
contracts that I have to read, right? But like I could imagine that I, if I'm
searching in my contracts, I want to know like by customer, right? And then be
able to filter to that because I have like a custom DPA, a custom MSA. I might
have like some amendments to the agreement, blah, blah, blah, blah, right? I
might want to search for like what lawyer worked with me on this, right? Who was
involved? Who is the signatory? Right? Like all of these different things that
are just essentially these follow-up filters that both a human and also an agent
would find really interesting to discover in the data that, I mean, at the end
of the day, right, like what something like turbopuffer really just needs to do
is to allow humans and agents to converse with the data. So we need to provide
all of the queries that can do that. And so faceting is a part of that because
there might be hundreds or thousands of different colors, right, in a
collection. And we can surface some aggregate that makes sense because it
doesn't make sense to load a billion records into a context window, right? The
computations required on the attention to do that is just not possible to earn a
return on. So what does that database look like? Well, it's going to have some
kind of fuzzy search, but it's also going to have some other massaging of the
data that you can do to pivot the data around and massage it to explore
interesting things about it.
Jason Liu [47:06]:
I mean, to answer the question in the legal domain, you can imagine an example
where maybe I am searching for some clause in a legal context, and so I return
10 text chunks that have the clause. But maybe what I actually recover is the
fact that I have some facets on the file name, and I realize, you know, 90% of
the clauses came from three files. The agent might then say, you know what, I'm
not going to do my semantic search query next; I'm going to just load the whole
damn PDF, right? Because the PDF says this PDF ID has been referenced 10 times
by the semantic search. Just read the whole damn file, right? And it's a
separate read-file tool versus a search-chunks tool. And that could be another
simple example of that.
Jason Liu [47:55]:
That's great. That's great. All right, we have four minutes left. You know, the
question I generally ask everyone at the end is, you know, what is something you
feel like folks are not thinking about when they are thinking like reasoning
about these kinds of systems? Yeah, I'd love to think, yeah, when people are
building these things, what's something you think people are not thinking about
these days that they should?
Simon Eskildsen [48:13]:
I think they're not thinking about the latency of the embedding model. There's a
lot of embedding models out there. I'll pull a very concrete example because
it's one of the most widely used. The OpenAI text-embedding models are great,
but the problem is that they have very high P50 latency. The P50 latency you get
is 300 milliseconds. Well, it doesn't matter that the turbopuffer latency is 8
milliseconds when it takes 300 milliseconds to create a query vector. So we've
seen from customers where they create hundreds of millions or billions of
embeddings, and it's not just OpenAI; there's lots of these providers where the
latency is very high, and then they use it fine for these agentic and Q&A
workloads, but then they're like, "Okay, it's time to do real-time search," and
it's too slow. And they've already created all these embeddings, and again, it's
not like people are not right now limited by the turbopuffer cost; they're
limited by the embedding costs and re-embedding everything to switch models. It
sucks, and so don't fall into that trap. Make sure that if that's a use case
that you care about, that you measure that embedding latency. We've seen great
latency from models like Cohere and Gemini. I think Voyage has also really
gotten very good at latency. Together and Jina have also had pretty good
latency, but there's some of the models that really just don't have very good
P50 latency. So just make sure that you test it yourself; it matters not to just
take blanket advice because the latency from the region that you're in to
whatever region that they're in also matters because that's the region where
they get the cheapest GPUs so that they can earn a return again. That's what it
all comes down to. So you want to do that testing for yourself. It's very easy
to do. You can ask Cursor or whatever coding agent of choice you're using, and
it can do this test for you in like five minutes, but you got to do it so you
don't paint yourself into a corner.
Jason Liu [50:07]:
That's actually a great answer. I should just go out and make this run script
that people can just try out and benchmark everything against.
Simon Eskildsen [50:14]:
You should. You should do it in your course because legally you can't do it in
public — the ToS prevent you from doing that at scale.
Jason Liu [50:15]:
I'll put that on the to-do list.
Jason Liu [50:21]:
Perfect. Yeah, we'll do that as the next action item. With that said, you know,
if there's any other message you want to have for the audience, feel free to
share that now. And then I'm going to follow up with you afterwards, get some
links, and we'll distribute them in the reader notes.
Simon Eskildsen [50:36]:
Yeah, I mean, no, not really. Come try turbopuffer if any of this resonated.
Jason Liu [50:42]:
Everyone else, thank you for tuning in. We'll send the recording and the notes
afterwards. And again, Simon, great talk. Learned a lot. And see you around.
Thank you all.