Kaivalya Abte [0:00]:
Hello everyone and welcome to the show. This is the Geek Narrator podcast. I am
your host Kaivalya Abte and today I have Simon with me, and he's coming back on
the show. Last time we talked about how you can think about building a database
on top of object storage, which was an interesting idea. We had a lot of
interesting discussions around it. And today we are going to talk about all
things vector, like vector search and vector databases. We'll try to understand
from Simon's experience how to think about it and how to build a mental model to
solve and build applications using vector search and vector databases. So
welcome, Simon. I'm really excited to have you back on the show. For our viewers
who are new to the channel, who are new to listening to you and new to
turbopuffer, let's start with a bit of introduction about yourself, about
turbopuffer, and what you have been doing these days.
Simon Eskildsen [0:56]:
Yeah, absolutely. My name is Simon. Thank you so much for having me on the show
again and to be back. Last time we talked about building a generic database on
top of object storage. I think today we'll get into turbopuffer and building a
search engine generically on top of object storage. You asked for a little bit
of the origin story of, I guess, the database and myself. I spent almost just
shy of a decade building infrastructure at Shopify. I saw the rise there from
hundreds of requests per second up to a million, and as part of that, I worked
on almost every single part of the infrastructure, including search. When I left
Shopify in 2021, I spent a little bit of time hopping between my friends'
companies, consulting on infrastructure, investing a little bit of equity in
their companies, and helping them out with something that was top of mind. For
one of the companies, the biggest thing I did for a lot of these companies was
tune Postgres auto vacuum, which is very difficult to tune and a bit foreign
coming from operating MySQL for almost 10 years at Shopify. But one of the
companies, this company called Readwise, builds an engine to help you read
better. They have articles and highlights from your articles and books that you
read from Kindle and things like that, and they wanted to do recommendations or
at least try it out. So we ran a two-week sprint on doing recommendations, and I
was really interested in vectors because back at Shopify, I had worked a bit on
search, mostly on the infrastructure side, but you end up working a bit of
everything. Bringing relevant search results is kind of a nebulous activity,
right? If someone searches for "red dress" on an e-commerce store, you need to
show them a burgundy skirt because you don't have a red dress, but maybe they're
interested in this. There's a lot of that, and taking strings of data and
turning them into things, as my coworker Doug always said, we need to make
strings into things, is very difficult when all you have are strings. Of course,
vectors solve that problem. They take a red dress and a burgundy skirt and place
them adjacent in the coordinate system, and we wanted to do that with articles.
With LLMs, you basically chop the head off the LLM and outcomes embeddings
because there's an internal representation that often these are trained on. You
do some fine-tuning specifically for this, but that's a simple way to think
about it. What we did was take these articles and build vector representations
of them, thinking, okay, these articles that are going to be adjacent in vector
space with a small distance between them in the coordinate system are going to
be good recommendations. We actually got pretty good recommendations out of it.
They were okay, not phenomenal. This was early in the LLM and embedding days in
2022, but it was pretty good, and it's better than I could have done with
strings. We ran the back of the envelope math on putting this on one of the
vector search engines at the time, and it was a phenomenal cost, around $30,000
a month for a company that was spending maybe $4,000 to $5,000 a month on their
Postgres database. It just seemed out of reach, and it was a sad event because
we had this idea that actually worked, and we wanted to come into the product,
but cost was holding us back. I distinctly remember Dan, one of the co-founders,
saying, "Well, I'm sure the costs are going to drop just as they are for LLMs."
Four or five months later, I couldn't stop thinking about this problem, and no
one had come out with a vector search solution that was more economical. It was
literally holding product back at this company, and they thought, "Well, there
must be some latent demand here." There's Jevons paradox, which often comes up
alongside AI and falling prices — it basically says that if you make gas cheaper
by, let's say, 20%, then people often drive more than 20% more. You can intuit
that this is as in, you know, if you make gas cheaper, you live in Berlin,
right? And you can make gas cheaper, and there may be some trade route to some
other city for some local company that's selling whatever pastry. Now they're
going to start offering to also sell that in Hamburg because the fuel costs are
now working out. The activity you get there is the same argument for making
energy cheaper is nonlinear. I thought that might be here for search engines the
way that it has been for analytical data in the 2010s. I didn't see all this at
the time, so there's a little bit of retroactive analysis here. But certainly, I
saw the example firsthand with Readwise. Thinking back on my experience at
Shopify working on search, it was very clear to me that working on the Elastic
clusters at Shopify was one of the hardest databases to operate, right? For all
the right reasons, and it was the best thing on the market, but it was a very
difficult thing to upgrade, scale, and it was holding back a lot of search for a
variety of reasons—cost being one, indexing performance being another. I think
back to the days that I've been debugging these clusters until midnight trying
to get them to perform better or some incident. My experience at Readwise and
thinking that, okay, you know what, maybe it had become clear it was time for a
new search engine, and vectors seem to be pressing this even more because when
it comes to text, which is still the most ubiquitous form of data, the vector
representation of the text is often larger than the canonical text
representation. You can imagine an article that's maybe 1,000 tokens long,
right? So you represent that as 1,000 tokens. Maybe the tokens are five bytes on
average, so that's about 5 to 6 kB of data for the raw text, which compresses
incredibly well. It's going to be hundreds of bytes to store this. Now, a vector
representation of that text, you're going to have multiple vectors, maybe one
per paragraph or something along those lines. Let's say that this 1,000-token
article is now four vectors, and those vectors might not all be 1,536
dimensions, so that's around 6 kB each. Now you have four times 6 kB, right? So
24 kB. You can't really compress these vectors because every floating point has
so much semantic information in them that you can't really compress them well,
but you can with text where there's a lot of repeatability. Now you've got this
vector representation that's almost 20 to 30 times larger than the compressed
representation of the text itself. That's why we saw at Readwise that the vector
cost was so high because all of it had to be stored in memory for a variety of
reasons, and it was larger than all the original data and the compressed
representation inside of Postgres where the article was also stored. That's
always been a problem with storage. It's always been expensive because even
inverted indexes also have this storage amplification, right? It stores how many
bits do you store per bit of input data? For an inverted index, that might be
somewhere between 1.5 or 2.5. Sometimes it's smaller than even the original
text, depending, but it's generally in that range. But now we're talking about a
storage amplification of 10 to 30x, right? If you build a database on top of
object storage, as we talked about in the other episode, the fundamental
trade-offs are high write latency to write directly to object storage in the
hundreds of milliseconds and high tail latency in the hundreds of milliseconds
when something is cold on object storage. These seem like phenomenal trade-offs
for search, especially if you then make it 100 times cheaper, which S3 is,
right? Two cents per gigabyte, and memory is around two cents or $2 per
gigabyte. That seems like a good tradeoff. On subsequent queries, things are
loaded into memory, and things are going to be fast. So that is kind of the
origin story of turbopuffer.
Kaivalya Abte [9:15]:
Very interesting. It looks like the economic cost was at the center of your
decision-making, right? Because that was the main problem, as you discussed
about the storage amplification. Apart from cost, of course, like we discussed
in the last episode, building something on top of S3, you already have a lot of
guarantees around durability and the scalability of the system. You don't have
to build that into your system. Of course, you have to think about it and handle
some of the edge cases, as you said, about latency, write latency, and tail
latency, but then that also simplifies your architecture. You also mentioned
some of the terms like you used the term "embeddings," you used terms like
"vectors" and "dimensions" and "tokens" and stuff like that. I don't want to
assume that all our viewers know all these basics, so let's quickly get these
basics out of the picture so we can talk about the real stuff, like how it
works. What exactly is vector search? What exactly do we mean by embeddings? We
can probably take an example and talk about it and explain in really simple
terms what they mean.
Simon Eskildsen [10:34]:
Yeah. First off, I use vectors and embeddings somewhat interchangeably. I think
there might be a more precise definition for each, but for all intents and
purposes, we can talk about them as the same. What are they? Well, LLMs are
very, very good at knowing what data is relevant to other data. That's why
they're next token predictors, right? You have "hello," and it thinks that
there's a very high probability that the next word is going to be "world."
That's fundamentally what an LLM is able to do, and one of the internal
representations of that are a bunch of floating point numbers. You could think
about them in the simplest way that if you had a two-dimensional point and you
train the model to figure out the points that are close to each other in space
are also semantically close to each other. What I mean by that is that you take
a banana and you take an apple, and you train the model to say, "Well, they're
pretty close." They're not quite the same; like a pear is maybe closer to an
apple than a banana, but they're pretty close. I mean, they're at least closer
than a computer is to any of those words. Now you end up with a coordinate
system, right? It's easiest to intuit this in two dimensions. You end up with a
coordinate system, and you will end up with essentially these clusters, right?
There's going to be a fruit cluster, there's going to be an electronics cluster,
there's going to be a whatever animal cluster, whatever, right? The information
that is close in the coordinate system is also semantically close in the real
world, right? So that's really what a vector embedding is. It's a point in a
coordinate system where stuff that is related to other stuff is close. You could
think about the Euclidean distance between the two points, so that's just what
you did in geometry in sixth grade or whatever, right? LLMs are very, very good
at taking information and turning them into a bunch of floating point numbers in
this vector space. The reason we call it vector space and not a point is because
it's not in two dimensions; it's in maybe some multiple of 64, 128, 384,
whatever, thousands of dimensions. You can think about each of these dimensions
maybe representing a real thing. We don't really know what they do, right? It's
latent space. We don't completely know without spending a lot of effort on it,
but you could imagine that maybe one of these dimensions represents something
about size, right? An elephant has a very high floating point number as the
first one, and an ant has a very low number. Maybe dimension number two
represents something about color, right? Maybe dimension number three represents
something about where in the globe this thing often presents itself, right? You
can imagine then you can represent almost every object like that, and that's
what LLMs are very, very good at. LLMs are good at taking information and
turning them into a coordinate system, into a point in a coordinate system. What
we were not so good at was storing this coordinate system, right? And making it
cheap to store and cheap to query. It's been cheap to take information and turn
it into a point in a coordinate system, but not so cheap to store the coordinate
system and ask it questions in high-dimensional space of what's adjacent to
this, what's close to this point, what's the average of something that's close
to these points, or whatever questions you could think to ask such a coordinate
system. That's how the ANN, or the approximate nearest neighbor index, is
representing that coordinate system in a way that's efficient for orienting.
Kaivalya Abte [14:33]:
Interesting. I think that's a very clear example and clear picture of what all
these things mean. Of course, there are so many dimensions to think about when
we search for things and not like a direct semantic search. That sounds good.
Let's take an example for our entire discussion so we can base our discussion
around that example and try to model that and solve that in a way using vector
search and vector databases, and how we can do that, let's say, using
turbopuffer. The example is again related to the podcast. So far, let's say I
have more than 100 episodes, and I want to kind of search across all the
episodes that I have. Give me all the clips. This is like the high-level feature
that I'm talking about, right? Then we can go a bit lower. Give me all the clips
of discussions where we were talking about a certain database or a certain
index, like binary search or something like that. Given a specific topic, I want
a clip, like timestamps of all the clips that I have, right? Specific to my
podcast, for example. If I have to build this, how do I start thinking about
this problem space? How do I know that vector search is something suitable here,
vector databases are something suitable here? Let's start talking about that.
Simon Eskildsen [15:59]:
Yeah, absolutely. I love rooting all of this as an example, and I'll try to
bring it back into the vocabulary we were using. So let's design a search engine
that searches through all this data. If I search for "object storage," right,
this episode will come up, and the relevant points are going to come up when we
talked about that in either episode. The way you would do this, right, is that
you first have to make a decision about what my chunk is called, right? A chunk
is what information am I feeding into the model to get a point in the coordinate
system, get the vector or the embedding? We have to decide what that's going to
be. We could maybe take the entire episode and turn it into a point in the
coordinate system. That might work, right? That's basically, can we fit the
entire episode inside of a context window? But intuitively, it would be very
difficult to represent an entire episode about a vector database in a, say,
1,024-dimensional space, right? That's a very, very good compression algorithm
if we're able to do that. So we might go a little bit lower granularity. That
would be very cheap, right? Because you have 100 episodes, so 100 vectors. We're
not talking about that much data, 6 kB, 100, right? So we're talking like less
than a megabyte of data. So that would be great. It's very simple. We could just
do that in memory in JavaScript in the browser. But the search is not going to
be great. If we break it down and trade a little bit more complexity here for a
better search experience, we might say that we're going to chunk based on, let's
say, every 30 seconds, right? So we say, okay, every 30 seconds, we're just
going to do a hard cutoff and then create an embedding for that. That's probably
going to be a decent representation. How far into vector space, if you intuit
it, how far into vector space can you really make it and color the whole
coordinate system in 30 seconds? So then we would take an episode like this,
right? And let's say it's 60 minutes, then you're going to have 120 embeddings,
still a very manageable amount of data, right? So 120 times 100 episodes times
1,024 dimensions. Let's say we're storing it all as four-byte integers. That's
like 50 megabytes of data. So we could, again, it's manageable in memory or
something like that. It's a very manageable amount of data, and we could build
probably a pretty good search engine on that. If you ship that with turbopuffer,
frankly, with that few vectors, you could also do it in memory. If you can
search 50 megabytes very fast, right? You can search memory at this point at 20
to 30 gigabits per second on a fairly small machine, so you could do that, and
we would get pretty good search results. At this point, we could start to,
again, ratchet up complexity for a better experience. One way that we could
improve the performance then is that right now, the default way is that you take
the query. The query might be "turbopuffer," and we search for "turbopuffer,"
and then we turn that into an embedding, and then we do the query to the
coordinate system. In this case, we might just exhaustively search through all
the vectors of all the episodes and then return the top results. For 50
megabytes, we might be able to do that in tens of milliseconds on a small
machine, so it's not too bad. If we had 10x that amount of data, like 500
megabytes, we're starting to run into latency on this, especially as we get into
gigabytes, but it would work. It would just be slower. What you can then do for
that is you can build an approximate neighbor index. It's called ANN. What we're
doing there is we're doing a kNN search, so we're comparing the query vector to
all the other vectors and finding the top 10 results to show on a search page.
An approximate nearest neighbor index is one where we trade off a little bit of
precision or recall, it's called, by building an approximate kind of geometric
index of this representation. One way you can do it, the simplest way to do it,
is that you take all the data and you try to cluster it. You take every point
and you try to see, okay, in this conversation, right, we're talking a bit about
databases, we're talking a bit about vectors, we're talking a bit about Shopify,
we're talking about this and that, and some clusters of conversation might
emerge, right? These 30-second clips can now be set into these clusters. To make
the search faster, we could imagine that you search for "turbopuffer," and it
looks for a cluster maybe on databases based on this conversation, and it
searches only that cluster. It's not searching the Shopify cluster; it's not
searching the whatever. I mean, we're going to talk about databases the whole
time, so there might be some sub-clusters in that cluster, but intuitively, you
can imagine there's a banana cluster, right? There's an animal cluster, and
there's a computer cluster, right? You only search the relevant one. That
clustering you do with an algorithm, you use something called k-means, right?
That sort of groups content together to do it optimally is kind of an
NP-complete problem, so you have to do some approximation. Now, this is not
always going to work, right? You're going to search for something that straddles
the boundary between some of these representations. If you're searching in
music, there may be artists that are not pop, not rock, but they're somewhere in
between. There's some approximation with how many clusters you visit. You can
imagine that you might have a vector index with a million vectors, and you might
have, let's say, a thousand of these centroids, right? These clusters that are
in the content, and you have to choose how many clusters that are adjacent, that
are closest to your query vector you're going to look at. That's one of your
trade-offs in these approximate nearest neighbor indexes. There are also other
ways to represent it, but this is probably the simplest one. We could do that to
speed up the search, right, for your podcast search website. The other thing we
can do is we can try to improve the search results. What we might find maybe is
that the 30-second boundary is not great. It doesn't represent itself well.
Sometimes we're changing topics a lot. You might try to find a way to represent
an embedding that is sort of at some normal conversational margin. This is a
podcast, so you're kind of asking me questions, and then I might give
embarrassingly long answers, but this is not going to—you can't bet all of that
because I'll change what I'm talking about all the time. So you might maybe use
an LLM, maybe a very fast, cheap LLM to figure out what the boundaries are and
then do embeddings on that. Maybe you find another way; maybe you're always
doing it based on sentence. There could be all kinds of ways of deciding what
the embedding should be to improve the relevance. Another thing you could do is
that you could try to incorporate other ranking signals to improve the
relevance, right? You could do things like, "Ah, I'm going to weight podcasts
that are more recent higher." You might do that. You might also combine it with
a text index because maybe the embedding model doesn't have any chance to know
what "turbopuffer" is, and so it thinks that "turbopuffer" is maybe a fast fish,
but really it should be categorized in the database cluster, and you might get
bad results as a result of that. Full-text search is going to be great because
it will know to just match directly on "turbopuffer." Maybe we give that a
higher relevance. This is kind of the curse of search, this iterative loop of,
"Okay, am I improving search results?" You have some test cases; you look at
whether they're better against what you would know, what you would do knowing
all the data. This is kind of like that constant push and pull between, "Am I
going to put more data in the database, and how do I index it faster?" That's
the part that something like turbopuffer takes care of, and then you take care
of knowing your data, your users, how you want your search to be structured,
improving all these different signals via turbopuffer's APIs, or another vector
database or search engine in general you're using. We don't think of turbopuffer
as just a vector database. It's certainly a prominent feature, and we're very
good at storing vectors, but we can also do full-text search and other things to
help you get good search results.
Kaivalya Abte [24:21]:
Okay, I think that's a good mental model, like how we should start thinking
about the problem. When we really face that kind of problem, for example, for
this kind of data, I don't think that we really need a database. As you said, we
can also do it in memory, right? But then you have to also keep tuning about the
search results and how you are planning to do the chunking, and probably you
want to create more clusters. Then you can think about making it faster and so
on. I think this example, if we start thinking about a pretty high scale, for
example, let's say now I want to build it as a service for all the podcasters,
right? I want to onboard them, so this massively increases the scale. There are
many other things coming into the picture because the data size increases, the
type of podcasts and the discussions would increase, not only technology but
maybe something else as well. Depending on how you want to provide the search,
how you want to chunk the episodes, how you want to add more clusters, I think
that all adds up to the complexity and the solution we should be aiming for,
right? Maybe for this podcast, it's fine to just do it in memory, but maybe if
we do it for all the podcasts, we want to have many more podcasters on this
platform, then it becomes a big problem, right? Something where we need to focus
on also on cost-effectiveness and stuff like that.
Simon Eskildsen [26:04]:
One thing, just before we move on, is that you might decide to use a search
engine even if you have a small amount of data just because it provides the APIs
to do some of these other things like full-text search. You don't have to do all
these things in memory. That would be one reason. If you only have a couple
hundred vectors, turbopuffer also just exhaustively searches; we don't index it
at all. Sorry, go on.
Kaivalya Abte [26:31]:
No, I think that makes sense. The decision has to be based not only on the cost
aspect, not only on what do you want to provide, but maybe if some database
provides you a simple solution, just using that and getting a mix of vector
search and full-text search, I think that could also help. But as you said,
let's increase the scale of the problem, right? There are so many podcasts, many
types of topics, different lengths, different guests. Probably there are also
specific guests that people want to search on, right? For example, if I have a
group discussion, right, not just one-on-one, but maybe there are more people in
the discussion, so people would also want to search when someone like Simon
talked about this thing, right? So that adds up to the type of clusters we want
to create and the index that would work on. So let's increase the scale of the
problem, increase the variety of the data, and then start thinking about the
problems that I might get when the latency spikes or the size of the data grows
and the cost increases. So how should I think about that?
Simon Eskildsen [27:52]:
Cool. We can walk up the order of magnitudes here. I think that's an interesting
exercise. So yeah, let's say at the 50 megabytes, which would probably be what
the vector representation here would be. First of all, that's much more than the
text of all of this, which you could compress well. If you go up in complexity
and you start to search more data, you might have millions or... In the millions
of vectors, most databases can do a really good job of vector search, right? If
you have a Postgres instance, Postgres comes with a number of different
extensions that you can find that are open source to do vector indexing.
Honestly, that's great. If you just have a couple million vectors and the
features in Postgres are going to get you there from a relevance point of view,
that's probably simpler than introducing something else that you have to
replicate data into. Where we start to see our customers seeking us out is when
you start to get into the tens of millions or even into the hundreds of
millions. That's when building these indexes and storing all the data and
querying it, especially if you're doing it with a WHERE clause, starts to get
tricky. Let's talk about the WHERE clause for a second because you asked about
that. What if I want to search for a particular author, or maybe I want to
search for a particular date range, or I want to search for a language? Now your
vector space is not just what's the closest thing; it's what's the closest thing
that also matches this filter. It's not just about what's similar, but what's
similar that also has this hard constraint. That starts to get very tricky. We
have a blog post on our website where we talk about what we do here. At a high
level, you can think about it as let's say that we've tuned this to know that
when I search for something, I look at the three closest clusters, and I search
for everything that's in those clusters. By default, let's say that those
clusters are of size 1,000. I'm just making up numbers here, right? That means
that every time we search something, we're looking at around 3,000 vectors to
try to get a good, accurate result, right? For kNN, we searched at everything,
so we know that that's 100% accurate. For an ANN index, often they're 95% or so
percent accurate or more, which means that for say the top 10 results that you
get from an exact search and the top 10 results you get from the approximate
index, 95% of the time, basically there's an overlap. Actually, yeah, on
average, right, across everything. Some are going to be 80; some are going to
be 100. That's what recall means in that context, and that's a very important
thing for any vector database. When you have a WHERE clause, maintaining a high
recall of 95% starts to get more tricky because imagine now I search for those
three clusters, and they have 1,000 vectors each, but only 500 of those vectors
match the filter. Now I'm only looking at 1,500 vectors every time I do a search
query. Is that really enough to get 95% recall? Generally not. You probably want
to still look at 3,000 as you were in the other case. Now I have to look at more
clusters, and I have to keep looking at clusters until I've looked at maybe
3,000 vectors. These are the things that make ANN indexing very tricky because
you have to tune this for the dataset, and you have to tune this for basically
how rare is the filter. This one was one that cut 50% off, but if you cut 90%
off, maybe you have to look at more clusters, or maybe you can come up with
another clever way of navigating the clusters. Maybe you have another index
where you execute the WHERE clause, and then you just look up the clusters and
know how much matches. There are lots of different ways that the query planner
here has to take into account to provide high recall with the filtering. That's
the kind of stuff that we talk about when we talk about vector databases and why
a new category emerged. I think that category will turn and turn from vector
databases into search engines. But this is the kind of thing that sometimes
Postgres will have trouble with. Postgres will make some naive decision and not
have a query planner that's completely integrated into the high recall use case.
We see some cases of this. You can get away with this by just looking at more
data or pre-filtering and things like that. But in general, this starts to be
challenging. The other thing that gets challenging when you get into the tens of
millions or hundreds of millions of vectors is that the vectors are large and do
not compress well. If you store it in Postgres, right, you are kind of by
definition in a relational database. You are going to store it on the writer, so
the main replica, and then you will often store it on two replicas, right, into
two read replicas off of the writer. This means that you're storing the data
three times. The other thing you're going to do is that no one in their right
mind is going to run a database at a 90% disk utilization, right? That's just
too scary. You might run it at, let's say, 50%. You might be comfortable with
that. So now we have to buy the actual disk to store the data on, and those
disks, like let's say an EBS volume or a PD in GCP, cost you somewhere between
10 and 20 cents per gigabyte, depending on what's underlying this disk. NVMe is
a little cheaper, maybe 8 cents, but it has its own challenges because as soon
as you restart the node, it's gone. You really want to know what you're doing
before going to those. With the network disks, 10 cents per gigabyte. Let's run
down to that. But because it's 50% disk utilization, you're actually paying 20,
and you have three of them. This means that every gigabyte of data you store,
you have to pay for it three times, and then you have to multiply by two because
of this utilization. So you're paying 60 cents per gigabyte of data stored. On
object storage, you pay two cents per gigabyte. Of course, with vectors too,
you're not just storing the vector once. You often store the vector multiple
times. You have to do all these things where sometimes you store it not just in
the cluster closest, but in multiple clusters and things like that. Often these
indexes get even larger than just storing the vector multiple times, which is
why something like Postgres and the economics of it start to at some point not
make sense. At some point, essentially, if you think about two things that are
optimized for, it's the cost and operational complexity and then the cost and
then the operational complexity of having a different database. At some point,
those graphs cross, and you may decide that you want to abstract it out. Since
the dawn of time, something like Postgres has had full-text search features.
MySQL has had Sphinx and others. But often at this company's scale, this becomes
one of the things that gets uneconomical. It becomes difficult to manage inside
the main database. Keep it in there for as long as you can because it's simple
to manage. But there comes a point where babysitting that becomes more difficult
than operating a separate search engine outside of the Postgres universe. You
might also be ready for some of the features to enhance your search that are not
in there. So then it becomes a question of that and also of indexing, right?
Building an approximate nearest neighbor index is an expensive activity. If you
have a 32-core Postgres cluster or 64-core cluster and you're building a 100
million index, that could consume all of your CPUs while you're doing that. I
love the DBAs at Shopify, but I don't think any of them would let me run ANN on
all of those hundreds of thousands of shards, right, to build ANN indexes at
that scale. That's kind of the trade-off, right? But I would encourage people to
use, like if they can get away with the features inside of something like
Postgres, I would really, really encourage it. Those are expanding over time.
But you can't really outrun the economics, and they catch up to you quick
because the vectors are so large.
Kaivalya Abte [35:31]:
Yeah, that's interesting. I think it also adds up because, let's say you use
Postgres and you release a feature that, oh, now you can do this, but then you
always want more features, right? Your users want more features, and they want
to search more. They want to also filter more, and then it's always going to
grow. So even if it works now, maybe it won't work, let's say, in six months or
one year down the line, depending on the data size, of course. You mentioned a
few interesting things there, right? One, like building these indexes is a
really expensive operation. So I want to understand how that is done. But before
we do that, let's say now I decide to use a vector database, something like
turbopuffer, for example. When I use LLMs, I create all the embeddings, and I
give it to turbopuffer. What exactly happens then? When is the index created?
The other question I wanted to ask was about recall, but probably after this
question.
Simon Eskildsen [36:45]:
Yeah, I mean, simplified, when you write a vector to turbopuffer, we put a file
into object storage in our write-ahead log, right? Every new entry makes it in
there.
Simon Eskildsen [36:56]:
Then we return an OK once we've committed to object storage. That's our fsync.
If you've committed 10 vectors into a turbopuffer database, we call these
namespaces. You can think about them as just the prefix on S3. When you have
written, say, 10 vectors and you query it, we're just going to load them all
into memory, right, or stream them in from S3, then into the disk cache and into
memory. Doing anything else, write 100 vectors, we'll still exhaust the search.
At 1,000, we'll probably still exhaust the search, but once you get into many
thousands or tens of thousands, then we'll start building the index. The
indexing nodes are going to get a notification to take that namespace, that
prefix on S3. Let's say it's "podcast/," and it will go get "podcast/" and then
it will start building the ANN index. We're not using a graph-based algorithm on
turbopuffer for this; it's always going to be something centroid- or
cluster-based, like I described before. There's a lot more tricks to it, but
that's the indexing we'll do. Then we'll start building that index. When you've
done a query on turbopuffer, it will look for inside of that prefix whether
there's, let's say, an index file. If there's an index file, then it will go and
look up the vector in the index. If not, it will just exhaustively search and
apply the newest WAL on top of the index for that namespace. Yeah, that's how
the writes work.
Kaivalya Abte [38:30]:
Interesting. So it automatically decides that, oh, this is the right time I
should build indexes because the data is growing, and the user doesn't have to
do anything about, you know, "Oh, create index," something like that. So that's
not there. It automatically decides. On the query side, how does it know that
for this particular query using an index would be better? Because that may also
depend on the query itself, right?
Simon Eskildsen [38:57]:
Yeah, so that's the query planner's job, right? By default, with ANN searches,
it will always use the index if it's there.
Kaivalya Abte [39:04]:
Mm-hmm.
Simon Eskildsen [39:05]:
Exactly. It uses the index, and I mean, there's multiple indexes, right? If
you're filtering, then we have to build non-ANN indexes for that and kind of
interleave them. The query planner has to look at the statistics of the data
distribution and make decisions about how it's going to do that to provide high
recall for the query. But if there's an index, we'll always use the index.
Kaivalya Abte [39:24]:
Okay, interesting. What are the other types of indexes that are typically
useful? How does it decide based on the nature of the data or the queries it has
seen, or how does it decide what kind of index would be suitable? Or maybe
multiple indexes, and the query planner can decide to use this index and use
that.
Simon Eskildsen [39:48]:
There are multiple types of ANN. Again, these are the approximate neighbor
indexes that are in use today in production. I'd say in broad strokes, there are
two that are most used. There are the centroid-based ones. That's what
turbopuffer uses because it works best for object storage, and it works really
well for lots of writes for reasons we can get into if you want to get into
that. Then there are the graph-based indexes. The graph-based indexes are
phenomenally fast at queries, right? Where a centroid-based index might take a
millisecond, a graph-based index might take 100 microseconds, right? But
everything has to be generally in memory for a graph-based index, so that's the
downside. If you're doing 100,000 queries per second and you want to do it in a
single node, it's very hard to beat a graph index. But very few people are doing
that.
Kaivalya Abte [40:36]:
Mm-hmm.
Simon Eskildsen [40:36]:
That's why we think the centroid-based approach is very, very good for a lot of
workloads. The graph-based indexes, if we just get into it a little bit, you can
imagine a graph-based index as you try to build a graph over time where vectors
that are close in vector space are also connected in the graph.
Kaivalya Abte [40:55]:
Mm-hmm.
Simon Eskildsen [40:55]:
Then essentially, you can imagine that you start in the middle of the graph, and
then you look at, okay, this graph might have, say, 64 other nodes or vectors
that it's connected to. I'm going to go and look at the one that I'm closest to
and perform a greedy search. Then I have a priority queue of what am I now
closest to. I keep navigating the graph for some hyperparameter-tuned number of
steps and size of the priority queue, and then at some point, I spit out the
result. This is really fast in memory, but the trade-off is that when you
navigate a graph, you have to do a lot of round trips, right? It's like I go
here, then I have to go get this data, then I go to the next point in the graph,
then the next point in the graph, and every time I do that, I have to do a round
trip. A round trip to memory is fast, right? It's in low hundreds of
nanoseconds. If it's in the CPU cache, it's in single-digit nanoseconds. This is
very fast to do 10 or 20 of them. But if you're going to a disk, a random read
to an SSD could easily be in single-digit milliseconds, even for an NVMe SSD.
For object storage, of course, we're talking hundreds of milliseconds for every
round trip. The way that something like turbopuffer is designed, it just doesn't
fundamentally work well with a graph. You can do things with a graph to shrink
what's called the diameter of the graph to try to minimize the number of jumps
you have to do, but you start to lose a lot of the benefits of it being a graph
to begin with.
Kaivalya Abte [42:17]:
Mm-hmm.
Simon Eskildsen [42:18]:
The centroid-based approach, right? We're back to the coordinate system. We're
back to finding information that's semantically grouped and clustered, right? We
have the rock cluster, the pop cluster, and the electronics cluster, right? Data
that naturally groups together will also naturally work when we then search for,
I don't know, electronic dance music, and then it goes to the electronic cluster
and searches just that. Centroid-based indexes make use of that, and they work
really well for object storage because we just have to do fundamentally two
round trips, right? We have to store all the centroids somewhere. The centroids
are the average of all the vectors in the cluster, and we just search all of
those. We find the top k closest clusters, and then we go download all those
clusters. We're downloading a lot more data than we do in the graph, but that's
what disks and S3 are very, very good at, is downloading a lot of data. This is
also called IO depth, right? Basically, this fan-out needs to be very, very
high. But if you have everything in memory, fundamentally, it could be a bit
slower than doing a bunch of round trips, right? Which is the fundamental
trade-off between these two approaches. But it works very, very well for object
storage and for disks to do the centroid-based approach.
Kaivalya Abte [43:28]:
Interesting. Do you also do this in parallel? For example, I imagine that when
you say, "Okay, search this cluster and the other cluster," these are again S3
files, right? If you have to search, let's say, a given search space, you
already know that. You could also make parallel calls to S3 and make the search
results be faster, right?
Simon Eskildsen [43:57]:
Of course.
Kaivalya Abte [43:57]:
Yeah.
Simon Eskildsen [43:57]:
Yeah, we do exactly that. I mean, we go way beyond that, right? When we search
for something, we do all these S3 requests. But even the S3 requests, often you
want to—if a block is larger than 4 megabytes, you want to do multiple block
fetches and things like that. There are lots of optimizations you can do to try
to saturate the network. When the bits come back on the wire from S3, we just
transpose it, project it directly into memory. It's completely zero copy, right?
We can get it as fast as possible with as few memory copies as possible. The
only copy should be the copy from the network buffer right into user land, which
would be io_uring and so on. You could even get around that. That massive
concurrency is really good for S3, and it happens to be also very good for
modern disks. A lot of software today, including all the Unix utilities like tar
and so on, do not take advantage of the fact that you can do so many concurrent
reads to a disk—a ridiculous amount, right? Millions per second, and very little
software today takes advantage of that. You have to remember that these kinds of
SSDs only became generally available in AWS in mid-2018. Not a lot of databases
fully take advantage of something that was only GA in the clouds that late.
Kaivalya Abte [45:14]:
Interesting. That's very interesting. The other thing that you mentioned, and I
want to understand how that would work is, you know, so basically you may end up
searching everything or you may end up searching parts of the cluster depending
on if you already have the index. You used a term called recall, right? You also
want to meet a certain level, let's say 95%. How do you compare? How do you
measure that the recall is not going down, and where is that happening?
Simon Eskildsen [45:50]:
There are two ways of doing it. There is the academic way or the CI way where
you generate a bunch of datasets, and then you generate a bunch of queries. For
every query, you do KNN search, so the exact search, right? The 100% accuracy
search that's slow, and you generate the correct result for, let's say, the
top 10. Then you run your approximate queries and then you compare the result
and you see, okay, on this query, it's 90%. It's 90% match. On this one, it's
80%, and this one is 100%. You average the results; you got the recall. That is
the—I mean, every vendor does that, right? Everyone runs that as part of their
probably their CI or when they make changes to how their ANN indexing works.
They run this on a bunch of different benchmarks. So that's table stakes. If you
can go beyond that and generate your own datasets, you can do it with filtering,
right, which is the difficult thing. It's not that hard to do ANN search, but
it's harder to do ANN search with WHERE conditions and filters. Then you need
datasets around that. I make sure that recall is high for that, and you can do
it against an exhaustive search. So that's one way we do that. Of course, we are
also very humble to the challenge of production where we don't control the data
distributions. Academic datasets tend to have nice data distributions and not
always have the messiness of real-world data, where you might have a million
duplicates and things like that, and you still need high recall. What we do is
that in production, for a percentage of queries, we run recall in production. We
have a fleet of nodes, and all they do is just run exhaustive searches against
ANN queries to see what the actual recall is. Inside of our Datadog, we then
have grouped by all the different organizations. We can see what their actual
in-production recall is. So that when we tune the indexes and shift changes to
make sure that this is meeting a high-quality bar. Recall is very tricky to get
right on every single query, which is why we do this empirically. You can
imagine that there are some datasets, for example, where this is something we
come up against a lot. We run a POC with a customer; we share the recall
numbers, and maybe the recall is, let's say, 92%. They're like, "Oh, we want it
to be like 90%." We look at the queries, and it might be that they have a lot of
queries that don't have a good result. If you search for text, if you search for
"turbopuffer" in a dataset that doesn't mention "turbopuffer," you will just get
zero results, and you will be happy with that because there is nothing about
this thing. Even if there is something about vector databases or there is
something about fast fish, you're just not going to get it, right? And that's
fine, right? We've reached the limits of full-text search. But for vectors,
there is no such thing as a zero result, right? A banana is similar to a
computer in some way. An apple is maybe more related to a computer than a banana
is, but they're related in some way. When we analyze recall on some of our
customers' datasets, often it's because they do what would traditionally be a
zero result. But because everything is related to everything else in vector
space, there's no such thing as a zero result. You can intuit that if you search
for something in a coordinate system, and nothing is close to it, and all the
clusters are really far away, it sort of ends up being a bit randomized what
results you get, especially in a centroid-based approach. In a graph approach,
often it fares better on these results that don't matter, but on a centroid
approach, not so much. We see a lot of limitations of recall in that way as a
general measure. It's a really useful signal, but there are cases where it
doesn't fare that well because it doesn't encapsulate this kind of thing. In
real life, we would never see this in academic datasets because often not in all
of them anyway, but in a lot of the academic datasets, they do have good query
distribution against the actual data, but in real-life datasets, people will
search for things that are not even close to what's in the dataset.
Kaivalya Abte [49:58]:
Interesting. Yeah, that's a very good point. Of course, it's very challenging to
observe this thing, and you cannot probably do it on each query. Talking about
going a bit further into the database, like you said, you have S3 files, and if
I remember correctly, last time we discussed about the particular format that
you use, and I think you use the Rust library rkyv for the storage format. Is it
the same format that you have for indexes and the data, or is it something
different?
Simon Eskildsen [50:37]:
Yeah, I mean, we do intentionally have some differences in how we store in the
index and the data. Part of that is just safety, right? If, God forbid, we
caused a problem in how the data is indexed, well, then we have it in a
different code path for where it's written into the WAL. It's a safety thing
that they are similar, but they both use rkyv. rkyv is a Rust library for a
zero-copy serialization format.
Kaivalya Abte [51:18]:
Interesting. Cool. Talking about, so this is, so we discussed how the data is
getting ingested, how files are getting created in S3, and then how indexes and
when indexes are going to be created. What does it give me back? Is it again
like a set of vectors? Because whatever I am ingesting in the database is
basically only embeddings, right? How do I convert back the embeddings into
something meaningful?
Simon Eskildsen [51:53]:
Yeah, so generally what you do, right, is that if you have a chunk of text,
again, a snippet of a podcast episode in raw text, you send that to OpenAI or
Voyage or Jina or one of these companies that specialize in creating embeddings
and training models for it, and you get back the vector, right? A bunch of
floating point numbers. Typically, what we see our customers do is that they
will send the raw text alongside or the raw image or whatever the raw entity is
alongside the text. Because we store it in object storage, you can then say,
"Well, I don't need filtering on this; I just need it to be stored alongside,"
and give you a heavy discount because it's stored inside turbopuffer on object
storage like that. Generally, you don't try to reverse engineer it; you just
store the canonical data, and there's inevitably going to be some loss of
precision where you can't—I mean, it will be a little bit like taking a JPEG,
downscaling it, and then expecting it to be able to upscale it again, right? You
are compressing it in some way, even though it is often larger than the original
text, right? For images, often it's smaller, but for something like text, it's
generally bigger, but it's lossy compression.
Kaivalya Abte [53:03]:
Okay, so what I get back is already the original data, right? Because you're
storing it closer in the vector database.
Simon Eskildsen [53:12]:
Yeah, alongside the vector, yeah.
Kaivalya Abte [53:14]:
Okay, so that's really powerful also in the case of turbopuffer because anyway,
the storage is all in object storage. Even if I give the entire data there, the
size of the data is not going to increase the cost because I also have the data,
and I also have the embeddings and stuff like that. Of course, the data doesn't
grow with the type of queries that you're doing, like the number of vectors
you're going to create and so on. So I think the data is still lower as compared
to the vector and the indexes that you're storing, right?
Simon Eskildsen [53:50]:
Generally, yes.
Kaivalya Abte [53:51]:
Right. Okay.
Simon Eskildsen [53:52]:
There's cases for everything. You could do embeddings of massive pieces of text,
but in general, yes, exactly.
Kaivalya Abte [53:57]:
Interesting. That's cool. What are some of the interesting use cases that you
have seen, you know, people using turbopuffer and using it for a combination of
vector search and full-text search? What are some of the challenges that you
have seen?
Simon Eskildsen [54:19]:
Yeah, so we can talk through some of the case studies of some of our customers,
right? We've got Cursor as one of our customers, and Cursor, of course, if you
use Cursor, you'll have seen that sometimes it draws context in from the
codebase. You might search for the code that does the clustering, and then it
knows that, oh, it's the k-means code. It's in this file that we don't have open
right now. That's what Cursor can use to have the agents and the edits and
things like that, drawing context from the entire codebase. So that's one way
that they use it. They trained their own embedding model where, you know, they
take the code and they chunk it up and then encode it with their own embedding
model. Aman, one of the co-founders, shared some really impressive stats about
this embedding model that they're training. That's one use case. Another use
case is Notion. In Notion, if you're inside of Notion, you have the ability to
use AI to start documents, ask it questions about your data, "What's my
company's policy?" or "Can you draft a document about this for our weekly
meeting?" From that, it needs to draw in context from your Notion workspace, and
so they can create embeddings of your entire Notion workspace and then store it
in turbopuffer and then enhance the elements of that, right? So those would be
some of the case studies of some of our customers that people might know.
Kaivalya Abte [55:51]:
Yeah, absolutely. I think everyone might have used Cursor by now because it's so
popular, and of course, Notion as well. While I was thinking about Notion and
also some of the use cases, I don't know, for Canva, there's a lot of
duplication, right? People are creating the same type of images and the same
type of videos and stuff like that. Do databases like turbopuffer also take care
of that to optimize that, like reducing the redundancy or the duplication, at
least for the same customers, right? Maybe for different customers, you may not
be able to do that, but for the same customers, is it also done?
Simon Eskildsen [56:38]:
I mean, there's two answers to that question. I think that if you know that you
have a lot of duplicates, then you can work on it on your data, right? You can
represent a document that says, "Okay, this is the image, and here are all the
places that it occurs in the dataset," and do your own deduplication. That would
be the customer doing it. If you have an image in, let's say, 100 Notion docs,
we can't really know that. I mean, all those vectors are going to have different
metadata, right? One is going to say, "Oh, I'm inside of this document," and
"I'm inside of this document." So we can't really do much there. You might be
able to do something in the ANN index where we compress it harder because we
know the vectors are the same and they have these attributes. Those are the
kinds of optimizations that a good ANN index will do when it encounters a lot of
duplicate data. But it can be hard to generically do it across all attributes,
right? But generally, turbopuffer is at a price point where we don't really want
people to have to worry about these things, right? We want people to just pump
in a lot of data, all the data they can possibly get their hands on, and make it
available to the LLM that they want to query, whether it's through an agent Q&A
or just general search for the end user. We don't want people to have to worry
too much about that kind of thing.
Kaivalya Abte [57:49]:
That's fair enough, right? Because ultimately, the cost of data is what you are
optimized on, so people may not have to worry about that. How do you make sure
that, you know, since these creation of indexes and, I don't know, even search,
it's a pretty expensive operation, right? How do you make sure that the
different tenants or customers have enough isolation so that they don't have
problems, especially when you have things on object storage?
Simon Eskildsen [58:29]:
Yeah, so object storage is by nature very good at multi-tenancy, right? There
are certain things that you do on object storage to make it better at
multi-tenancy. One of those things, for example, is that at the end of the day,
what object storage does is that pretty much every database does some kind of
sharding. It's just a question of whether it's automatic or not. S3 and Google
Cloud Storage and Azure and all of them do the same thing where, you know, if
you have three keys, A, B, and C, and A and B are really—or like C is really,
really hot, and A and B are moderately hot, then it might decide to put A and B
on some nodes and C on some other nodes, right? Continuously split the key space
so you can do various things to randomize the key space as much as possible so
that this kind of activity gets better and everything gets uniformly over the
shards. But in general, S3 is very good at this kind of stuff; GCS is very good
at this kind of stuff. We just have to help it a little bit on our side, right?
Of course, when you query turbopuffer, you will get into a node that we think
has cache affinity to that data. Those nodes could get hot, and so there are
various things we can do there to split traffic across multiple nodes. We can
make sure that tenants aren't impeding on each other by limiting the amount of
CPU for different tenants. There are lots of different things like that you have
to do when you run a multi-tenancy system. But the flexibility of this is great
because suddenly you're ingesting a billion vectors in the course of over a few
hours. We can scale that up, and we all share the compute when there's large
spikes, so it works really well. This is one of the things my co-founder Justine
and I spent an enormous amount of time on at Shopify, which is probably one of
the largest multi-tenancy systems in the world, right? It's making sure that
capacity is fairly shared, making sure that if one tenant drives millions of
requests per second, which might often be bots, that other tenants do not get
booted out or have some kind of degradation of service because of them.
Prioritizing traffic for a tenant, so for example, traffic that has a cart is
more valuable than traffic that doesn't, or whatever it might be. We can do
these kinds of things over time too to make multi-tenancy really good for
turbopuffer. At the end of the day, multi-tenancy really, like the main thing
you can do is just scale up as much as possible, and turbopuffer can scale up
very, very quickly. But at the end of the day, the clouds will only give you a
machine in around 120 seconds or so before it's booted up, and turbopuffer can
boot up very quickly. These are things that we'll get better at over time, but
the system performs very, very well.
Kaivalya Abte [1:01:01]:
Very interesting. Talking about observability, right? Any database needs to have
some metrics around how it is performing, query latencies, and index usage and
stuff like that. What are some of the interesting metrics that are probably
specific to vector searches and full-text searches?
Simon Eskildsen [1:01:22]:
Sorry, for monitoring?
Kaivalya Abte [1:01:23]:
For understanding the performance and if there is any problem going on and if
you need to scale up and stuff like that.
Simon Eskildsen [1:01:32]:
Sure, I mean, the biggest thing we look at is latency, right? So we look at core
latency. You look at indexing throughput, so how many tens or hundreds of
megabytes per second can you index? You look at how long the cache is living,
right? How many days does something take until a TTL is out of the cache? We
look at raw throughput. You look at CPU and things like that, of course. You
look at memory—very standard things. I think maybe the most domain-specific are
just how's the indexing throughput and how are we keeping up with how fast
people are ingesting? So writing into the system, how fast are we indexing what
people are ingesting? We can index it often up to tens of thousands of vectors
per second, so as long as you're ingesting at a lower rate than that, we're
good. If you are, we allow spikes and bursts, and maybe we'll throttle you.
Worst case, we can't keep up, so these are the kinds of things that we look at
on the write side. On the query side, I mean, we have lots of tracing in place
to look at core latencies and things like that. There are cold queries; there
are warm queries, right? So we optimize both the cold queries coming straight
out of object storage. We optimize the warm latency if the WALs make it to NVMe
SSDs, which takes quite a bit of work to make work really well. Then you
optimize the hot queries in memory that are often quite fast.
Kaivalya Abte [1:02:53]:
Very cool. Where is this? I understand that more and more databases, because of
cost reasons and because of all the simplicity that it brings in, people are
moving to object storage at least in some of the use cases. Where do you see
this, you know, the trend of vector search and vector databases going? Where do
you see turbopuffer will be used more and more?
Simon Eskildsen [1:03:25]:
Yeah, I mean, I think what we see our customers doing is that they want to
vectorize all the data they have, and they often have a lot more data than
they're currently vectorizing. That's the number one. The second is that we see
them wanting to do full-text search, right? If you're searching in a dataset,
then the embedding model might not know all the terms, right? I'm sure you might
use terms with your friends or family or your partner that an embedding model
has no chance of knowing what they mean.
Kaivalya Abte [1:03:54]:
Right.
Simon Eskildsen [1:03:54]:
Right? That's where full-text search is really useful. If you're searching for
an opaque SKU, which is always some weird combination of characters, full-text
search performs really well for that. We see the vector databases, and at least
ours, more from being primarily focused on vectors to being focused on search in
general, right? Allowing you to rank even if you're just doing vector search a
file that you just edited or a document you just edited or a song you just
listened to or whatever it might be more relevant. We have to help you with that
kind of stuff. Typically, when you search, you sort of take millions or tens of
millions or billions of things and comb them down to a couple hundred. That's
what turbopuffer is good at. Once you have a couple of tens or hundreds of
things, then typically you'll have a model where you give it the entire document
or song—in your case, it would be just that snippet of text—and the query, and
you kind of run the full LLM over it. You can imagine passing that to ChatGPT
and then asking, "What is the most relevant of these ten?" and then you get a
result back. We see our role in that to do that first stage retrieval, right? To
find not the needle in the haystack, but find the needles in the haystack that
are relevant to you. It could be an LLM; it could be a trained re-ranker or
something along those lines. We see the trend here being that in the past,
people are used to writing a giant JSON document to describe exactly how
everything is ranked. We see that people are now writing, instead of writing big
JSON queries or SQL queries, they're writing a simple JSON query or a simple SQL
query to go and locate the 100 or tens of needles in the haystack, and then in
their Python script, use some kind of re-ranker—sometimes called a
cross-encoder—based on these documents you get back. This is very, very
competitive to these very large JSON queries that we've seen in the past passing
into often these Lucene-based engines.
Kaivalya Abte [1:05:55]:
Very interesting. Cool. I think, you know, in this one hour, we can only cover
so much. But I think we have covered the basics and also some interesting things
about how turbopuffer is able to provide these features and how it is
cost-effective and what are the challenges involved. But the space is large,
right? What would you recommend people read more in terms of understanding
vectors and how they can build applications around search as a unified
experience, not only vector but maybe also full-text and exact searching and
stuff like that? Are there any resources you would recommend?
Simon Eskildsen [1:06:41]:
Yeah, I mean, there's a lot of papers on all these indexing algorithms. I think
that honestly, ChatGPT is pretty phenomenal for this kind of stuff. I would ask
it, like, "How does it work? What's a graph-based index? What's a cluster-based
index? Hey, can you please give me a super simple Python or JavaScript
implementation? Play around with it." That's probably the most effective way. If
you have the patience to sit down with that. Our blog has some content; we would
like more content, and docs have content on how these things work, how
turbopuffer works, the architecture of turbopuffer, observability, and I think
that will be a good guide as well. For full-text search, I mean, the Lucene
source code is where I've learned almost everything that I know about this from.
The docs are also quite good. I would recommend that you read the docs on how
the data is stored. That's probably the most useful. If there's something you
don't understand, get ChatGPT to reason through it with you. Probably Cursor on
the Lucene codebase is going to be hard to beat for learning how full-text
search works. There are books and things like that, but I think that actually
going to the database with an inquisitive mind and some good questions and just
reading the source code and quizzing it is going to be the most effective way to
learn as much as possible about this. The other thing I'll say is to actually
write the search application. That would be the first place that I would start
because you're going to just ask better questions as a result. So build the
thing that you're building, right? If someone should build that over your
dataset and hand you the keys, that is going to be the most useful way to have
the best questions to ask once you want to learn. So I would always start with
the actual use case.
Kaivalya Abte [1:08:16]:
Right. Awesome. Well, Simon, thanks a lot for joining me today and sharing all
the insights on how things are working in turbopuffer. I'll attach the links to
the blog and all the interesting things that you mentioned so people can also
refer. But as you said, building a search application and solving a use case is
going to be the best use of time, and learning and reasoning with ChatGPT or
something like that, that's going to be really powerful. Awesome.
Simon Eskildsen [1:08:51]:
Absolutely. Absolutely. Thank you so much for having me.
Kaivalya Abte [1:08:55]:
Absolutely. Pleasure.