Talia Goldberg [0:00]:
All right, team, let's get started. Welcome to Research to Runtime. This is a
session on building AI systems and agents with our esteemed guests, Ankur, the
founder and CEO of Braintrust, and Simon, the founder and CEO of turbopuffer.
For those that I haven't met, I'm Talia Goldberg. I'm a partner at Bessemer in
our San Francisco office where I lead many of our AI investments across stages.
Bessemer is a global venture firm. We partner with entrepreneurs from the very
earliest days through every stage of growth and are very proud to work with
companies like Perplexity, Anthropic, Fall, Abridge, Canva, Shopify, ClickHouse,
and many others, including those that are customers of Braintrust and
turbopuffer. We've seen the power of these products and what they're enabling
firsthand, so we're super excited to share that with the broader group. With me,
I have my colleague Bavik, who's an ML engineer by training and also works with
me to lead our AI investing. We really started this series because the landscape
is a roller coaster—there's a roller coaster of progress, innovation, best
practices, and everyone's trying to figure out what to do and how to build and
best practices and tactics. It has been awesome to get this community together
to share that and to learn from some of the best. With no further ado, let me
let our guests introduce themselves and maybe give a little background on
yourself and your company. Ankur, why don't we start with you?
Ankur Goyal [1:29]:
Yeah, sounds great. Very excited to be here and chat with you all. I'm Ankur.
Prior to Braintrust, I used to lead the AI team at Figma. Before that, I started
a company called Impura where we did AI document extraction in the stone ages,
pre-ChatGPT, when it was quite hard. At both companies, every time we changed
something—like updated our models, changed our prompts, or changed the
underlying architecture that we used—we would break stuff for customers. We had
to get really good at avoiding that. To do that, we built tools to help us do
evals well. It was really hard to get data to do those evals, so we built
observability tools to help us actually collect data in a way that was useful
for evals. The third time around, that turned into Braintrust. I was reflecting
this morning, and I actually only know two Simons, and I really like both of
them. The other Simon is at Notion, and that Simon was one of the first people
that we talked to. He shared how they did evals at Notion, and we started
working with them—Zapier, Scribe, Instacart, Airtable, and a bunch of other
really great companies that are building AI products. We were doing it really
early, and by working with them, we collaborated and established some really
good workflows around evals and observability that are now the Braintrust
product.
Talia Goldberg [2:47]:
Awesome. And Simon?
Simon Eskildsen [2:50]:
Yeah, I'm Simon. I spent almost 10 years building infrastructure at Shopify. I'm
up here in Canada, and when I joined Shopify, I was doing a couple hundred
requests per second, and by the time I left, we had seen peaks of around a
million. I worked on mainly the things that did not scale, playing whack-a-mole
on all the bottlenecks when the Kardashians rolled through and did some of the
largest flash sales in the world. The fundamental bottleneck for most major SaaS
platforms is the database layer, so I more or less worked on every single aspect
of the database and scaling the compute layer for almost 10 years at Shopify.
When I left, I was bopping around at some of my friends' companies, helping them
with little infrastructure things. At one of them, I discovered the massive cost
of embedding-based search. With this one company, a company called Readwise, we
just did some very simple article recommendations over the course of a month. It
worked pretty well, but this was a bootstrap company that was spending $3,000 a
month on their Postgres, and putting all of this into actually operationalizing
all these vectors would have cost them $30,000 a month, and they actively did
not do it because it was too expensive. That's what we set out to do with
turbopuffer: to do an order of magnitude cost reduction to unlock a lot of this
product that people wanted to ship. That's what I work on today.
Talia Goldberg [4:09]:
Amazing. Thanks for sharing that and giving a little bit of the background.
Building off of what you just said, Simon, and just as a little bit of context
for folks, what are the specific decisions in traditional vector databases that
create these cost explosions, and what were the things that you guys did to
really address that? I think it's known or publicly reported that folks like
Cursor have 20x cost reductions by switching to your architecture, and it's not
just cost; there's also latency and speed. What did you do, and what were the
challenges?
Simon Eskildsen [4:44]:
Yeah, I think it became clear when you start a company, you generally have some
insight or something that you think could be done differently. I think the
insight that led to turbopuffer was that there was a new storage architecture in
the air where we could use S3 or GCS or Azure Blob Storage as the source of
truth. That only really became possible in the past few years. The new storage
architecture that was in the air was one where you use NVMe SSDs. NVMe SSDs are
about 100 times cheaper than using memory, but the throughput that you can drive
through them is only about five times less. So if you can build a database that
takes good advantage of it, you can run into some real economic advantage. The
second thing that happened was that S3 became consistent at the end of 2020, and
S3 got compare and swap at the end of 2024. This has allowed us, with these
three new principles, to build databases that can have a completely different
storage architecture than those that came before—one where S3 or Google Cloud
Storage are the only source of truth. You don't even need a metadata layer. I
think Ankur and I can both go into a lot of depth on how to use the metadata
layer, and I think we have some different thoughts on how to do it, but this is
a new storage architecture. I think fundamentally, if you want to build a
generational database company, you need two ingredients. The first one is the
new storage architecture because if not, then all the incumbent databases are
just going to add on and eat you alive. But if you have a new storage
architecture, it has fundamentally new economics or new performance
characteristics, and you have a new workload that means that people are out
shopping for a new database—in this case, connecting enormous amounts of data to
LLMs is the general new workload. If you have both of those ingredients and you
have good execution behind it, you have the potential to create a generational
database company. I think that we saw those two things in the air. We talked
about Cursor, Notion, and Linear—some of our customers—and for a customer like
Cursor, it matters a lot what it costs, not because Cursor is cheap, but because
Cursor has to earn a return on all of the data that they have to search over.
That's fundamentally what we're all doing, right? We're earning a return on
whatever is underneath us. If we can change the economics in a way where the
products that people can ship on top of search are fundamentally different, then
our customers can ship more ambitious versions of their product. With Cursor, we
reduced their cost by 95%. That doesn't mean that we're the H&M of search; it
means that suddenly they could index much larger code bases for way more
customers on economics that they could earn a return on. Same with Notion—they
used to have a per-user AI cost, and part of them removing that was switching to
turbopuffer to go from a more traditional storage architecture into one where we
could reduce their spend by millions of dollars a year on this new storage
architecture. It made a lot of sense for them. I think I'm more in the business
of making sure that people can earn a return on top of us with the product that
they build than anything else.
Talia Goldberg [7:46]:
Yeah, I love how you talk about this very clear moment of "why now," because
it's really hard to build a database company, as you said, and it's really hard
to have these switching costs, and folks get very built in. What do you see for
companies like Notion or others, or even a company that has an existing setup?
How do they switch over, and what is that process like to get going on
turbopuffer?
Simon Eskildsen [8:08]:
There's a couple of profiles, but I think there are two major categories. The
first ones are more net new workloads—these are either newer companies or ones
where they only have very simple search, and they can wholesale replace it with
a simple hybrid search inside of their app, using normal lexical search on top
of and then vector embeddings as well and fusing it together. These are more net
new, where it's very simple for them to do it. They generally are newer adopters
and then start on turbopuffer and then go from there. The other type of customer
that we see are ones where they have existing lexical search systems, typically
Lucene-based search systems that are tuned for lexical. They see that they can
boost their search recall and precision by 10-20% by incorporating embeddings,
which is huge, right? Some of these companies will have people that spend an
entire year improving search results by one or two percent because that really
ends up mattering on average to their customers. So 10-20% that we've heard from
some of our customers is enormous, especially when you also are starting to have
machines use that same search to try to find and research all these new baseline
SaaS features that we're starting to expect. In those companies, generally, they
adopt us alongside their existing lexical search and then over time shift more
and more search workloads over to turbopuffer, including some of the net new
within the business, but also incorporating some of them and then querying both.
Over time, you would like them to move everything to turbopuffer.
Bhavik Nagda [9:43]:
Ankur, are you seeing that? Companies are also benchmarking, call it latency
recall of the vector systems that they're using in your product. What's the
entry point to Braintrust?
Ankur Goyal [9:50]:
Yeah, so I actually haven't seen people run performance benchmarks, like speed
benchmarks, of different vector databases in their evals. I think the reason is
that you don't really need evals to do that. If you have a set of cases or the
shape of data that you want to test the performance on, you can test it once or
twice and get a reasonable measure. But if the words inside of a particular
paragraph of text change, it's unlikely to affect the speed at which a vector
database returns stuff. We do see a lot of people actually eval the quality of
search results that they get, and we do see a lot of people find silly
concurrency problems where they're serializing vector database calls or
something in their waterfall of a trace. There are quite a bit of turbopuffer
API calls that get traced in Braintrust one way or another, but honestly, I
think most people that we work with that use turbopuffer, when I ask them about
it, they say the decision was pretty straightforward and easy.
Bhavik Nagda [10:55]:
It's helpful. And even just taking a step back, if I was building an AI agent
net new, what's the best time to start creating an eval harness or starting to
do tracing, production monitoring, all this stuff that Braintrust provides?
Ankur Goyal [11:07]:
Yeah, I would actually say it's on day zero. When we started, we talked to a
bunch of companies and we found that the people that were most interested in
Braintrust were people that had shipped their product three months ago because
you'd have this people tend to have a bit of confirmation bias leading up to
shipping a product that they've played with. Let's say you and I are working on
a product. We might sit in a corner and play with it until we feel like it's
good, and then we're like, "We don't need evals; we were able to improve this
thing." Then you ship it, and then Talia starts using it, and she's like, "Wow,
this thing sucks." You're like, "What? We were sitting in a corner, and it was
fine." No, it actually sucks. You're like, "Why does it suck?" And she's like,
"Why do I have to tell you?" That is usually the point at which people realize
that they need evals. However, the people like that start doing evals realize
that if you're not doing evals while you're actually specking out or thinking
about a project, then you're going to waste a lot of time. Usually, after people
get bitten by the eval bug, they start doing evals as they're actually
prototyping things. A large part of that is if you use evals from day zero, you
can build a prototype that kind of sucks, but you have a feedback loop built
into your development process that lets you improve the quality of the product
very quickly. If we were using the metaphor of you and I sitting in a corner,
we'd have to sit in the corner for a lot less time to be able to ship something
than if we weren't using evals.
Bhavik Nagda [12:34]:
That sounds cool. I just wanted to flag for the group, if anyone has any open
questions, please do add them into Zoom, and we'll make sure to answer them as
we go.
Ankur Goyal [12:43]:
I Wanted to add to what Simon was saying. You mentioned there's two things that
have to change quite a bit for a new database to be relevant. At Braintrust,
even though we're not selling a database, we actually build our own database
called Brainstore. Simon and I talk a lot about this all the time. It's also
built to run directly on object storage, and it literally wasn't possible until
S3 released compare and swap, for example. A lot of that stuff is very relevant,
but I actually think there's one more thing that's quite different in AI, and I
think this is especially true for us—maybe it's a little bit less true for
search—but in traditional observability, Prometheus-flavored observability, the
information that you track tends not to be tied to PII. The logs that are spit
out of your web server or the CPU metrics that you're tracking from all of your
containers or whatever, if that leaks, it's not the end of the world. It's
obviously not good, but it's not the end of the world. If your web server logs
are spitting out PII, then that's usually something like if Simon discovered
someone doing that at Shopify, that's usually something that would be considered
a pretty serious bug and fixed quite quickly. In AI, the interesting thing is
that what you're observing is actually people's raw interactions with LLMs, and
the information that you're observing actually inherently has a lot of PII in
it. The scrutiny that people have applied to Braintrust from a security
standpoint is an order of magnitude higher than the scrutiny that was applied to
observability products in the previous generation. That's one of the other
benefits of using object storage. We make it very easy for customers—we have a
Fortune 10 customer, for example, using Braintrust. We make it super easy for
them to run Braintrust inside of their own cloud environment. Part of the reason
that's practical for them to do is that Brainstore stores all of its state
directly on object storage. I think that is something that is very different
than the previous generation of observability systems.
Simon Eskildsen [14:54]:
I think there's actually a reason number three and a reason number four in the
"why now" of the databases, and you're touching on number three. I just usually
simplify it into one and two. The way that I often talk about three is that
there's a new deployment model. In 2015, I was working on migrating all of
Shopify to the cloud. A lot of the frontier companies didn't go into the cloud
until the late 2010s. Now everyone is running in the cloud; even most of the
enterprise is living in the cloud. It makes it a lot easier to ship things into
people's VPCs without ending up as an on-prem company very early, which just
becomes you building a support team very quickly. I think the new deployment
model is one, and then I think that I would go a step further here and also say
that this actually allows even more deployment models than just BYOC, where this
is easy operationally inside of another customer's cloud. I think this is one of
the most underappreciated things about this architecture, but also one of the
most important that we see when we're outselling, which is that with this
architecture, if you guarantee that everything is in object storage, it means
that you can use your customer's key to encrypt every single byte of data.
TurboPuffer exposes this. You can send an encryption key that's managed inside
of your cloud to TurboPuffer, which is logically the same, even though it exists
in our bucket, as if it exists in your bucket. This is fantastic for enterprise
because it means that IT gets all the governance controls that they need to have
to be able to shut down any data access, but without any of the operational pain
of running it even in their own cloud, where you want to lock it down as much as
possible. They can go a step further. A lot of our customers in SaaS do this,
where they use their customer's key to encrypt the data in the TurboPuffer
bucket. Not even our customer is able to see their key; it's passed all the way
through, and that customer has access to it. We see some of our more advanced
customers in their BYOC deployment, which is probably very similar to what
you're doing, Ankur, store all of the data inside of a bucket, and then
TurboPuffer just has access to a particular prefix to run inside of that bucket.
All of the data is in there, and it gives the customer this warm fuzzy feeling
of knowing that all of the data is in their control, either with encryption key
or in terms of coverage case, it could even be in their bucket.
Ankur Goyal [17:08]:
When it's the customer-managed encryption key, do you search over the encrypted
data directly, or do you decrypt it in memory and then search?
Simon Eskildsen [17:17]:
You have to use the key; then we get it into memory, and then you have to accept
some TTLs on disk cache and things like that.
Ankur Goyal [17:20]:
There's no homomorphic vector encryption yet.
Simon Eskildsen [17:24]:
That stuff doesn't really help you very much. It just makes it more complicated
if you break into one namespace, but you can do it. You can do these rotations.
It just makes the attack take longer. It doesn't fundamentally change the shape
of it. The fourth thing that is very nice if you're building a database company
is to get a lot of data very quickly.
Bhavik Nagda [17:55]:
A lot of traditional database systems have been built on NVMe non-volatile
memory. They haven't separated compute and storage in the same way. Can either
Ankur or Simon, whoever wants to jump in, just outline the sort of broad scope
evolution that we're seeing before we dive into specifics around databases?
Ankur Goyal [18:15]:
Yeah, there's not a lot of database systems that are actually built on NVMe. The
problem is that if you rewind two or three generations ago—like back when I was
working on databases at MemSQL in the ancient days—people were just starting to
get access to SSDs regularly on hardware, like on-prem hardware. Maybe back in
those days, Simon, you were starting to get SSDs and IO data center.
Simon Eskildsen [18:38]:
Mm-hmm.
Ankur Goyal [18:39]:
Fusion IO, that's good stuff. But most people can't even afford that. Some
database systems tried to build some fancy algorithmic support for these SSDs,
but pretty much no one did because SSDs are really bad at random writes, and
that is the main problem when you're building an OLTP system. Now fast forward a
little bit, and then people got infatuated with the cloud. For a long time,
there was no SSD support in the cloud. When it finally came, as these NVMe disks
arrived, they were not durable beyond the lifetime of an instance. All of the
stuff that people were working on before assumed that the NVMe would survive and
be useful as long-term durable storage, and that is still not the case today. I
think that's a key difference. I don't know of any commercial OLTP system that
is actually built with the assumption that it can use volatile NVMe.
Non-volatile is not the right word; non-durable NVMe. I think that's the big
difference.
Simon Eskildsen [19:45]:
PlanetScale is, but...
Ankur Goyal [19:48]:
There we go.
Simon Eskildsen [19:48]:
They sold it from the...
Ankur Goyal [19:49]:
But I would consider them in the latest generation.
Simon Eskildsen [19:53]:
Sure, I think there's also two parts of it, right? There's the operational side,
which PlanetScale is obviously exceptional at. Their Kubernetes operator can
handle these circumstances because what you can end up with is like three NVMe
instances are gone, and then all your data is gone. You have to have an enormous
amount of trust in your operations to be able to get onto these NVMe instances.
We can talk about the software side for a second. Ankur kind of alluded to the
random write piece, which makes OLTP incredibly difficult, at least any
B-tree-based OLTP, which Postgres and MySQL are. You need some LSM, and only
those are also only starting to mature. I do think that the storage engine that
takes advantage of NVMe is also fundamentally different than the ones that have
been written in the past. MySQL and Postgres were both written for HDDs so many
generations ago, and even SSDs are different, and then NVMe SSDs are different
again in the trade-offs you make with memory. The nice thing is that the storage
architecture that is required to take full advantage of OPEX storage and NVMe is
more or less the same because what you need to do to take advantage of NVMe on
the read path is that you need to do a lot of outstanding concurrent requests to
the disk in as few round trips as possible because you can't escape the, say,
100 microseconds of random read latency that you have to an NVMe SSD. In the
same way that every time you go to S3, you have a P90 access time of around
100-200 milliseconds, depending on the region and so on. You can't escape that,
and it's sort of fundamentally the same problem where you can max out the
network NIC to S3, and you can max out the NVMe port to the disk, but you have
to do an enormous amount of parallel requests in every round trip and minimize
the number of round trips. That's not how the storage engines that were built in
the 90s or the 2000s or even the 2010s were built with that kind of round trip
sensitivity in mind. S3 really forced our hand in making sure that we only do
three round trips for everything. The storage engine is capable of doing good
cold read latencies on S3 by minimizing round trips and maximizing concurrency.
That just also happens to be phenomenal for a disk.
Bhavik Nagda [22:44]:
No, that's helpful. Just to understand that better, since you're now using
object storage and you want to achieve some degree of data locality, Simon, my
understanding is that turbopuffer is best fit for customers that have natural
sharding in their data. Can you talk about that a bit?
Simon Eskildsen [23:04]:
Yeah, I think that a startup's only mode is focus. Our focus in the beginning
was any large multi-tenancy workloads where the P100 tenant was not particularly
large, so the individual shards could not be that large. That's not the case
anymore. Now we can do shards that are into terabytes without any issues. But in
the beginning, you're right. We took advantage of the fact that the largest code
base in the world, I don't know, something like LLVM or Linux or something like
that, is still not that large. Even for Notion with their workspaces, the
largest Notion customer was not that large. So we focused on getting very good
at handling a very large amount of shards. Now the plan has always been to get
very good at handling very large shards because then you have other customers
that have extremely large shards. Maybe their biggest customer has a billion or
maybe even tens of billions of documents they want to search simultaneously, and
then you want to get good at that. Every database shards; it's just who manages
for you and when does it happen.
Ankur Goyal [24:05]:
Right.
Simon Eskildsen [24:05]:
It's very funny.
Ankur Goyal [24:05]:
We have the exact opposite problem. Please keep going, but yeah.
Simon Eskildsen [24:11]:
What we're working on now are namespaces that are around half a terabyte, so
that works fine now in turbopuffer. You want to max out at probably half a
terabyte to a terabyte of shard sizes, but you want to make these shards as
large as possible because fundamentally when you're doing search or any type of
database lookup, it's like n times log n, where it's not actually log n for a
vector lookup unless—let's pretend it is. Then the M's, you want that to be as
high as possible because it grows logarithmically, but you want the M to be as
small as possible to do as few of those operations as possible. For reference,
in Elasticsearch, when I ran that in production at Shopify, you go for a shard
size of somewhere between 30 and 50 gigabytes—an incredibly small size. The end
log n is very large, which means you're spending an enormous amount of CPU
cycles doing that search. Whereas in turbopuffer, our shard sizes are trying to
get up to a terabyte, right? Successfully, and then at some point, you have to
spread that out into multiple machines and build a sharding management layer on
top of that, which will be there in future versions of turbopuffer.
Fundamentally, you can just do ID modulus n, which is what a lot of their
customers are doing. The short answer is yes, in the first version of
turbopuffer, we only supported small shards, but now we can do state-of-the-art
shard sizes.
Ankur Goyal [25:30]:
Yeah, it's quite interesting because we have, if you use one of those customers
like Notion as an example, all of the logs across their users are coming into
one Braintrust collection, and the task is actually to look at the information
across all of them. We don't have sharding based on a collection, but we do many
of the same tricks by using time as a partitioning key.
Talia Goldberg [25:58]:
With that, I think it would be awesome if we could go to a brief demo from both
of you. Maybe we'll start with you, Ankur, and talking through a bit of a
product demo. I think one thing that we heard as folks were signing up for this
session is that there are some questions around how to structure evaluations for
multi-step agent workflows where one failure will cascade through an entire
process—just things around best practices. As you go through and do the demo,
you can just talk a little bit about both of those things.
Ankur Goyal [26:32]:
Yes, I'm very happy to. Give me just one second.
Talia Goldberg [26:34]:
We'll see if the demo gods are with us.
Ankur Goyal [26:36]:
I think they are. I just want to pull up the right project. Okay, great. Great.
So we have an agent built into Braintrust. Before I talk about evaluating
agents, I'll just show you a little bit about that agent because then we're
going to look at the logs for that agent and a little bit about how we evaluate
it. Here is an example of a question I ran. I can do a fresh one: "Who uses this
feature the most?" Similar to things like Cursor or Quadcode, this uses an LLM
with a bunch of tools available to it, in this case to analyze the data in my
logs. It will run a bunch of searches over the logs, which take advantage of a
lot of the things Simon was talking about. For example, in Braintrust, if you
run a search and we read from object storage, it might be slow the first time,
but then all that data is cached in NVMe, and the second search is much faster.
This thing will take advantage of that, and it's going to try to run a bunch of
searches and then figure out who uses the product the most. Now behind the
scenes as that runs, it's actually generating traces that look like this. This
is like a classic agentic trace in something like Braintrust. Here's a system
prompt with some instructions and then the user's question and then a bunch of
tool calls, and you can see them interleaved. If you can understand the timing
of each of these calls, you can also do stuff like understand the conversation
in a more chat-like experience. You can see, "Okay, to that," and this is a lot
of the debugging that people do when they're actually playing around with
something like Braintrust. Now, the other thing you can do that is quite cool
and something Simon helped me figure out when we were first working on
Braintrust is search. I can search for something like my email address, and
you'll see that it will complete very quickly, and it will also actually update
all of these aggregates as well. That is again something that takes advantage of
some of those properties Simon was talking about. A lot of the data to calculate
one aggregate is similar to the data that calculates another aggregate, so once
it arrives on NVMe, it's quite easy to actually run a bunch of redundant or
similar calculations over the data very quickly. I think again it allows us to
build a really cool user experience. The last thing I'll show you is how this
actually translates to evals. I have a project, and this is a little bit stale;
I should probably rerun this. When we were first shipping this feature, we
actually ran a bunch of model comparisons. This is a pretty popular thing that
people do in Braintrust. They take use cases like the one that I just showed
you. If we click into this, you'll see that this experiment actually looks very
similar—like these logs look very similar to what we were looking at before. You
can do tracing in production, but you can also do tracing when you're evaluating
to get all the same debug ability. If we go here, you'll see we can actually
evaluate different models side by side and understand the trade-offs and stuff
between them. This is the kind of stuff that I think is really powerful. There's
a lot we can talk about with questions if people are curious about how to
specifically evaluate agents. The one thing I'll just quickly call out is I
think it's super important, in addition to running a bunch of really good
scores, to also track a bunch of metrics that help you understand the
relationship between LLMs and tools. The most useful thing I find is looking at
stuff like tool errors and trying to see if I use a different model or if I
change the prompt or something about how it runs, do I suddenly get a lot more
latency in my LLM calls and I suddenly get a lot more tool errors—stuff like
that.
Bhavik Nagda [30:16]:
Ankur, if I have a production system set up and I come in and say I can see the
logs and 95% of them are great, but 5% of them are wrong, how should I start to
prioritize what to work on or what to focus on?
Ankur Goyal [30:31]:
Yeah, I think one super simple thing you can do is capture user feedback, and
Braintrust allows you to do that quite easily. You'll see here there are some
score columns, and you can do stuff like filter. Let's see. Sorry, I think I
need to refresh the page. Yeah, we can filter and say, "Okay, I want to find all
of the examples where there's feedback." This is a pretty quick and simple way
to just surface things that may be interesting. If you are early in your
application development lifecycle, you probably don't have that many interesting
things. Right here, I have six, so I could probably look at all six of these and
look at the thread view and try to understand what actually happened here and
why was this a good or bad experience and then turn that into test cases to use
in an eval. If you find that you have too many, then you really have two
options. The first thing that people think about, which I think is a cool thing
to do but not often that useful, is to try to come up with fancier scoring
mechanisms. Maybe you have a thumbs up and thumbs down, and you have 10,000 of
those, and you come up with another scoring method, for example, using an LLM to
look at the outputs in addition to that and filter that down further. The simple
thing to do, which is what I honestly recommend in a lot of cases, is just look
at the first 10 of those 10,000, and you'll probably find something interesting.
Every time you go to the logs page, if you find one interesting novel case, then
it's ROI positive. You don't really need to think about it or overthink it too
much. As long as it is relatively time-efficient for you to find novel
interesting things that you haven't seen before, then things are good. If it's
not, then you should invest in more scoring to help you narrow things down
further.
Bhavik Nagda [32:30]:
Are there any common dark patterns or common mistakes that you see people make
either setting up eval harnesses or trying to close this loop?
Ankur Goyal [32:39]:
Yeah, I think a few things. The first is that people will only do online evals
or only do offline evals. I think you should think about your job building AI
software as to build a feedback loop because you can't predict what a prompt is
going to output. You need a feedback loop from what people actually experience
to what you're developing. To build an effective feedback loop, you need both
offline evals and online evals. Some people don't do that. Another thing that we
see people do is only trace LLM calls. It's very easy to do that; you can use
our libraries or there's some proxies and stuff that will let you do that. You
get some visibility into what's happening by just looking at the LLM calls, but
you get significantly more if you can capture the interleaving execution of LLM
calls and tool calls. I think it's very important to put in a little bit of work
and get traces that reflect the information that you actually want to see. The
third thing is if you're very lazy about scorers, the anti-pattern that we tend
to hear is, "Do you have a pre-built scorer for hallucination?" That's probably
one of the most common questions that we get. If you're lazy about that, then
you're just not going to get good results. The reason is that I think scoring is
essentially like the AI evolution of writing a PRD. It's your opportunity to
come up with the criteria that are very specific to your use case that, if met,
will result in a good user experience. If you're lazy about that, just like
you're lazy about writing a spec or writing a PRD, you're going to end up with a
crappy user experience that regresses to the mean. On the other hand, if you use
it as an opportunity to create differentiation for your product and really think
about how to capture the attributes of an experience that you think represent a
good outcome for the user, then you'll get a really good experience. We really
encourage people to create and customize their own scoring functions to try to
find stuff that is relevant to them.
Bhavik Nagda [34:40]:
What's a good example of that?
Ankur Goyal [34:42]:
So we have two customers that are very different—Vercel and Stripe.
Hallucination for Vercel means something very different than a hallucination for
Stripe. If you try to use the same logic to figure out whether something is a
hallucination, I think you might, in the case of Stripe, miss things that are
actually really important that you don't make up. In the case of Vercel, you
might not allow the model to be as creative or open-minded, if you will, about
the code it generates to help someone achieve what they're trying to do.
Actually trying to think about what that means in the context of Vercel, for
example, you can statically analyze code, and if it references a library that is
not imported or doesn't exist, that is a very verifiable form of hallucination
that is just completely irrelevant for a customer support bot and a financial
company.
Bhavik Nagda [35:32]:
That's helpful. Maybe one more question, and then we can chat a bit about the
demo for turbopuffer. One of the biggest gains of the last year or so with
regards to these agents has been tool calling and their integrations with
third-party existing systems. Now imagine when you're calling tools with an
agent, you could get back an image, you get back a PDF, you might get back text,
increasingly maybe videos and multimedia. How far do you anticipate Braintrust
will go in terms of storing that and creating that trace?
Ankur Goyal [36:04]:
Oh yeah, we already support that. We have a feature called attachments. The way
to think about attachments is like in a traditional database, if you have a VAR
char or a variable string, it doesn't store it directly in the B-tree; it stores
a pointer to it, and that lives in a cheaper and simpler form of storage with
some performance trade-offs. We do exactly the same thing for multimedia. You
can upload attachments to Braintrust, and they get stored directly on object
storage as blobs and referenced inside of the data that's actually indexed. It's
very cheap to do that as a result, and the trade-off is that you can't search
the basic C4 content in an inverted index, which no one cares about. It actually
works quite well.
Simon Eskildsen [36:48]:
That's super helpful, and you mentioned you use turbopuffer insertion across
Braintrust docs within this platform natively.
Ankur Goyal [36:55]:
So we are about to release a feature that uses turbopuffer, and that feature
lets you search the Braintrust documentation automatically. The word doc is
probably the most overridden term that Simon and I could refer to in this
conversation because it means many different things. We unfortunately don't
currently use turbopuffer to power the search experience and stuff inside of
Braintrust, but Simon and I are always scheming about interesting ways we can
collaborate.
Bhavik Nagda [37:28]:
Well done. Awesome. Maybe with that, I know Simon, you wanted to introduce the
demo TurboGraph and show the capabilities of turbopuffer.
Simon Eskildsen [37:34]:
Yeah, for sure. Yeah, it's a database, so I don't have a nice UI. Actually, I do
have a UI. I can show the UI and then I'll show you TurboGraph. Unfortunately,
we are a lot better at Rust than we are at React, so bear with me. Maybe if
someone in attendance wants to come help us with the dashboard, they should come
join us. This is what turbopuffer looks like in the dashboard. It's very simple.
This is our test account. You can see here it has almost 5 billion documents,
and the invoice is only $700. Now, we don't charge ourselves, but it gives you
an idea of the kind of scale you can get to with good economics on turbopuffer.
Very simply, it just shows you what's going on. You can see the namespaces that
you have here. When we test, we just create a namespace for every single one of
the unit tests that we run, so there's a lot of them. I think there are tens of
thousands. Some of our customers have millions, but it's a very simple UI here.
I will now show you TurboGraph. TurboGraph was just something that I've been
hacking on when I have time. I think Ankur has more time to code than me. I
don't know how; I don't think Ankur sleeps very much.
Ankur Goyal [38:46]:
I can tell you my story after this.
Simon Eskildsen [38:48]:
Yeah, please. Turbogrep is basically just like ripgrep aspirationally in the
limit one day. Instead of searching with regexes and full-text search, it embeds
the entire code base and then searches over the embeddings. What that allows you
to do is you can do a search like "Upsert to Vector Store," and it will return
the actual function that's doing that—in this case, like "write batch," which is
doing that. Inside of my editor, I can do "upsert to Vector Store," right? It
doesn't know. It will have to find out that it's turbopuffer and that it's
"write" instead of "batch," and it just finds the function. I use this all the
time because I honestly read a lot more code than I write code, and being able
to do a semantic search like this is extremely useful. You could also do
something like "How does it do a fast search while indexing?" to try to lead you
to the right thing, and it needs me this function called "speculate search"
because it's creating the embedding itself while it's chunking the whole code
base to keep it up to date. I can go read this from as I map my own vocabulary
onto it. I could do something like ASCII emoji for puffer fish, and it will take
me to the progress bar here, which is like an ASCII puffer face that's inflating
and deflating as you're doing things. It's a very simple demo, but a lot of our
customers do this sort of fundamentally connecting data to AI, whether it's code
or documents or PDFs or some kind of unstructured data. You can see here if we
reset this namespace, then what it will do is that it will chunk the whole
namespace, then it will create, then it will find the closest region, which I'm
currently closest to the Toronto region for turbopuffer, and then it will create
an embedding. This demo right now uses Voyage and creates the embedding, and
then it does the TPuff search, and it took about 25 milliseconds once everything
was written in and then served the search result. If you do this, it will chunk
the whole code base again and then rerun it, so it's very simple. We wanted to
have something simple out there for our customers to use, and also it's a very
useful tool for me. This will work also on much larger code bases. Actually,
maybe we can use that. Do I have Rails or something? I don't know if this is
going to work. Split a string. I don't know. Let's see here. Let me do this for
both. You can see here it's chunking the entire code base, and then there are
53,000 chunks, and it starts to just pump these into whatever H100 Voyager is
running to create the embeddings. It's a bit slow because I'm using a pretty
slow model, and I don't have all the rate limits. TurboPuffer can do about—could
do up to 10,000 or more writes per second, so this is currently limited by the
Voyage model or my uplink, one of the two. It's just going through the code
base, and eventually, once this finishes in about another minute, it will return
to query. But yeah, that's it. That's my little demo.
Bhavik Nagda [41:36]:
Yeah, Simon, once you've chosen turbopuffer for your performant database system,
there are a lot of levers you have—the chunk size, the embedding model, the
re-ranking model—to improve the search quality. Where do you recommend people
start, and what has the sort of highest effort-to-payoff ratio?
Simon Eskildsen [41:54]:
My general answer here is find an embedding model that's fast because we've seen
customers who have embedded tens of billions of documents with an embedding
model, and then they get great throughput—like more than the 500 a second here
I'm getting on my little probably trial account with Voyage—and they pay
potentially hundreds of thousands of dollars to embed all of their billions and
billions of documents, and then the query latency is 300 milliseconds. You want
to choose a fast embedding model, and that can vary a lot because you might run
all of your workloads in Frankfurt, and whatever embedding provider you're using
might not give you any choice over the region and rank it all to Oregon. Now, no
matter what, you're paying a 200-millisecond penalty to get that embedding
unless you have more control. Getting finally finished, it just takes a lot of
time. We recommend spending a little bit of time. You can use any code agent to
very quickly just find the one that's fastest for you and run it. The other
thing is that we recommend that people start with something very simple. Just
start with a simple embedding-based search on a small use case and then get it
in production. Don't try to do anything fancy until you have the evals in place
because if you start to do something complicated before you have the evals, you
end up in this space where you build a complicated system, but you have no idea
where in the complexity the value is. You want to be able to do that to start
building your intuition. What I generally say is stand up the simplest possible
thing that passes vibes and then try to get the evals going as quickly as
possible. Then it's about creating the evals. Creating the evals is like you
have to create a bunch of evals, make them, and then the people on your team
have to start creating evals. You should probably also recruit your cousin to
create evals, but you need a lot of evals. When I was working on search at
Shopify, we spun up massive teams of just people that saw it and created these
evals. Today, we have other means of doing that, but it's a very traditional way
of doing it, and everyone knows that's the best way to get good search results.
Then you can start layering on complexity, right? You can do a query rewriting
layer where you start to rewrite queries. You can use lexical search. You can
use multiple embedding searches as part of the same one. You can start to use
late chunking, late interaction. You can train your own re-ranker. You can try
different re-ranking models, but do not introduce complexity into the search
pipeline until you have the evals in place because otherwise, you will get
addicted to complexity that you have no idea if it adds any value.
Bhavik Nagda [44:13]:
That's helpful. The other trend we've seen is like as we move to these agents
from multi-turn LLM systems doing searches, they might—an agent might first try
a semantic search, then try the traditional search, and it seems like
turbopuffer offers hybrid search for those types of use cases.
Simon Eskildsen [44:37]:
Yeah, that's right. I think we find that there is—let's say that an embedding
model doesn't know what turbopuffer is. It might start returning like "fast
fish." I don't know if like a sailfish is the fastest fish, and that's not
helpful, right? Because it doesn't know what it actually is. Whereas a lexical
search model will know exactly what it is because it has no idea; it just sees
that it's a string. Embedding models are very good at turning strings into
things, but if it doesn't know what the thing is—which is often the case in,
say, you have a Notion workspace and you're using some internal wording that the
embedding model could never know what actually means—then the lexical search
comes in really handy. It's great to use both, and we find that most of our
customers have a lot of success with that, especially if you're powering some
kind of command case search, right? Where you want to search for "si," and the
embedding model is going to think that that's "c," and that's something
agreeable, and yes, right? It's Spanish for "yes," but really what you want is
just to find a document that starts with Simon, right? These things have to play
in concert. Generally, for search applications, it's pretty idiosyncratic what
works well. We try to give opinions about what we've seen work well on average,
but you need to have the evals in place to go from there.
Talia Goldberg [45:51]:
This is awesome. By the way, thank you. I can see I love that you both are
basically selling each other's products throughout this entire process. It goes
hand in hand.
Ankur Goyal [46:02]:
I think we've seen a lot of customers actually be successful with both products
together, which is pretty cool and not a super common thing, at least in my
experience.
Talia Goldberg [46:13]:
We've seen the same, which is why we were so excited to have both of you on
together as part of this because we knew that it would lead to great back and
forth and that there was so much overlap from just what we've seen even across
our portfolio and customers. I know we only have a couple of minutes left. I
would love to wrap on just hearing a little bit from both of you on your own
stacks—like your own setup, whether it's your own coding setup or tools
internally that you all use—anything that you can share that you think would be
interesting. To the broader group, we found a lot of the show and tell to
provide a lot of value, and y'all are experts and have good noses for what's the
latest and greatest at the cutting edge, so would love to hear that.
Ankur Goyal [46:55]:
I can go first. I think the main thing that has really changed for me is moving
from synchronous programming, which I also don't have the bandwidth to do a
whole lot of right now, except between the hours of 9 p.m. and 12 a.m., to
asynchronous programming. I think asynchronous programming is quite powerful,
and I like to think of it as like for people that are professional programmers,
it's like what vibe coding does for people that are maybe programming not
professionally. Asynchronous programming is basically working with a really
powerful coding agent or multiple coding agents at once to execute something
that is very difficult. It requires thinking a lot about testing and evals, for
example, so that you can let the agent iterate really effectively. It requires
reading a lot of code. It requires thinking about scope. You can burn out an
agent very quickly by giving it too much scope, but if you figure out the right
abstraction to let it work in, it can be very effective. I've spent a lot of
time over the past, let's say, six months or so trying to just personally hone
that skill. I don't think I'm like the number one world's best asynchronous
programmer, but I think I've improved quite a bit, and I've seen some people who
are extraordinarily good at it, so I'm just generally quite excited about this
trend. It'll be interesting to see how truly productive great programmers who
embrace it can be.
Talia Goldberg [48:22]:
We've seen some pretty wild demos and just even things on what people are doing
that have really embraced this on Twitter and even are orienting their sleep
schedules around stop agents.
Simon Eskildsen [48:33]:
This is the return of the Uberman sleep schedule where you'll sleep one hour and
then you're up for an hour and a half or whatever it is. No, it's less than
that. It's like 20 minutes every two or three hours when the agents wake up. I
think I'm leaning somewhere very similar to Ankur. I also don't have a lot of
time for the synchronous programming and finding the one to two hours of focus
that it takes to get some of these bigger chunks done. So it will be a lot of
terminal agents. I'm starting to use the background agents a lot more to do
various small things. Like yesterday, I was putting up another case study, and
it's just, "Okay, here's the Markdown file," and then I use this program called
WhisperFlow so I can just yap for a minute about exactly what needs to get done.
Generally, the agents are very good at that when you can just speak instead of
typing, which is a bit of a bottleneck for me. Then I think there was about a
six-month period where you have to use the coding IDs, and if you didn't, that
was a massive performance loss. During that time, I started having wrist pain
again, which was why I switched to NeoVim in the first place. So I'm back to
reading and reviewing all of my code in Vim and then having a lot of the
background agents and the terminal agents doing the work. The bottleneck now
very much is reading the code and coming up with good instructions, and that's
what I've been doing for the past many years. That setup is very well set up. In
the event, I have a script called review where you just pass it a GitHub URL,
and it just gets the whole diff into Vim and like in buffers and with chunks in
the gutter, and that's been very helpful. Now that you can just...
Ankur Goyal [50:02]:
Can you send me the script?
Simon Eskildsen [50:03]:
The agent? Yeah, I can send you the script. Yeah, it's open source; it's in my
dot files. It's like a teasingly coded. This is like a pre-LLM, so you know...
Ankur Goyal [50:13]:
I use NeoVim as well. Actually, I went through exactly the same thing, although
I started using an IDE when Figma acquired us just to learn about Copilot. The
last act of my IDE was actually writing my new NeoVim config file, which felt
very nice. It was like a nice piece of closure. There are actually a bunch of
cool things in IDEs now. The LSPs have gotten a lot better, and you can recreate
a lot of the IDE features in NeoVim, I think, quite nicely now.
Simon Eskildsen [50:45]:
Yeah, and there are some distributions that make that really easy. Turbogrep,
just to sell my little open-source project here, is very helpful for reading
code in unfamiliar code bases, which I also spend a lot of time doing—like
reading dependencies, things like that—which takes a while to map the vocabulary
that you have in your head of how the thing works to how it actually works.
Talia Goldberg [51:03]:
That's very cool. All right, team. I think we're at the hour. Thank you guys so
much for spending the time. This was great for everyone on the call and
following along online. We'll follow up with ways to contact and learn more
about Braintrust and turbopuffer. Check out their sites; they're also hiring.
Enjoy! But thank you guys so much. This was awesome.
Ankur Goyal [51:29]:
Thank you for having us.
Simon Eskildsen [51:31]:
All right. Thanks a lot.