Sualeh Asif [0:12]:
So you scaled a lot from 2010 through 2020. You know, what were the great sevs
in Shopify history?
Simon Eskildsen[0:18]:
One of the funniest ones was we had this problem where about every hour, the
primary or the writer of our MySQL clusters would stall for about 30 seconds and
we couldn't figure out why. We were just debugging this endlessly and could not
figure out what was going on. Someone figured out when this was going on, there
was an lsof running on these machines. I was like, OK, why is this lsof
running? And someone was, you know, tracing the kernel, figuring out, OK, this
is causing a soft lockup in the kernel. Where is this ls coming from? It turns
out that some of the Percona utilities, which are some of the Perl scripts used
to manage MySQL, drew in PHP as a dependency. And PHP as a dependency has some
standard cron job that every hour goes and does an lsof to figure out which
files, which sessions are open, and then removes the files for sessions that are
no longer actively open by the PHP process. And this was running every hour on
all of our MySQL instances.
Sualeh Asif [1:23]:
What were some of the great systems that Shopify was running, which was at a
scale that no one had run before?
Simon Eskildsen [1:30]:
I think Facebook had probably taken MySQL to really, really high heights before
Shopify, but MySQL was certainly one of the systems that we were all — a lot of
companies were just really compounding on the MySQL clusters through the 2010s.
GitHub was in a similar situation. And so I think we just spent a lot of time
scaling all the layers on top of MySQL. One of the big things that was more not
novel, but a lot of the SaaS apps in the 2010s had been written in something
like Ruby or Python or whatever, where you just had so many processes that
couldn't do that many QPS per process, maybe in the tens, hundreds if you were
lucky. And all of those processes had to have an individual connection to MySQL
or Postgres. Postgres is actually worse at this out of the box. And so with
MySQL, we had like 30,000 to 40,000 connections open, and so you're just
spending so much time polling through all these connections and which ones you
can operate on at any point in time. So we had this problem with Memcached, we
had this problem with Redis, we had this problem with MySQL. Today there's lots
of open source proxies in front of all of these systems, but at the time there
wasn't really, so we had to play a lot of tricks to reduce this and just reduce
the connection counts as much as possible.
Sualeh Asif [2:44]:
So how is the infrastructure team constructed so that you could handle like sev
after sev after sev? And as you're scaling through this, what were the really
interesting people that, you know, gel together to make all of this possible?
Simon Eskildsen [2:58]:
When I joined in 2013, it was still a very traditional structure where it was a
pure ops team, right? People who are just incredibly good at operating Linux
systems and SSHing into all of them, like sort of the, you know, servers as
cattle era. The sevs that we had were not so much just the systems falling over.
A lot of the sevs we had really came out of large flash sales driving enormous
amounts of traffic into Shopify at one point in time.
Sualeh Asif [3:25]:
What's the worst flash sale ever?
Simon Eskildsen[3:27]:
I might not have been around for the one that overwhelmed the system the most,
but I remember Kylie Jenner's flash sales as particularly challenging. I think
she drove a lot of traffic even on trial accounts, like just showing up.
Sualeh Asif [3:38]:
So walk me through a normal flash sale. When there's a flash sale, how does it
happen? What happens when a flash sale happens in the systems?
Simon Eskildsen[3:44]:
So a flash sale is generally someone with a very large following, and then
through the 2010s we saw this — they might have like 10 million followers, 100
million, I don't know what's a lot of followers on Instagram, 10 million, tens
of millions, and they have some new product they release. And so you drop the
product and suddenly, you know, millions of people, or like hundreds of
thousands of people, trying to buy the same SKU at the same time. And so that
turns into an enormous amount of inventory lock contention on MySQL on that
inventory row, and that was the kind of thing that drove a lot of outages. So we
had to do a lot of things to make sure that both the inventory reservation and
everything would work, because often you have hundreds of thousands of people
fighting for maybe 10,000 SKUs. That was pretty much all the sevs, right? And so
they would drive these sales and then they would just keep dropping things,
right? Like Kanye would have a new sneaker, put it on Shopify, and then it would
drive an enormous amount of traffic.
Sualeh Asif [4:43]:
So how are you guys preparing for this? You knew that there would be flash
sales, you knew that the flash sales are getting bigger and bigger. How is the
planning happening? How is the team getting ready for the next big horrible
event? And the horrible event can happen at any point. It's not like, you know,
for Stripe or something, there's like Cyber Monday or Black Friday, you know
it's going to happen on Black Friday, but these were happening at random points.
Simon Eskildsen[5:04]:
Exactly. So these were basically random events that would take from, let's say,
a thousand requests a second to a hundred thousand, right? So massive scaling
events, and you're right, they could come at any moment. Of course, we also had
to scale for Black Friday and Cyber Monday, but it was a little bit more
predictable. The flash sales were sort of random tests and we didn't know —
sometimes they came from a trial account. So the preparation was writing load
testing tools. We had load testing tools internally trying to mimic what the
users were doing. They were not particularly sophisticated; they were just
little Ruby scripts that we ran on a lot of servers trying to serve inventory
and contend on it, and then we would try to figure out what the bottlenecks
were. That was the majority of the preparation we were doing. But this was also
— the first flash sales that I was a part of were while we were still in data
centers, so we had a team that was racking servers ourselves. It was before the
cloud and so it was very difficult to manage how many servers that we needed.
It's still difficult and I'm sure you're also now running into this at your
scale because you actually have to go to the clouds. It's not infinite and you
have to tell them how much you're expecting to use. And with GPUs, it's probably
even more finite than in the CPU realm that I'm in now and that we were in then.
But we were just trying to pad with as much capacity as possible. So, I mean, at
the end of the day, you don't have that many choices when you have to scale very
quickly. You can scale up. It's too slow. And on-prem, you can't really scale up
much. The second thing is to cache harder. So larger TTLs and trying to move the
caching up. In the beginning of Shopify, we were doing a lot of the caching in
Ruby, but over time we moved some of that caching into Nginx directly and
serving it from Nginx Lua to try to just move more and more load off the
servers. We would also do things like shedding load. So if someone had a massive
flash sale, we would try to prioritize requests, like if you had a cart or you
had a session, you would prioritize that request over someone just coming to the
site. So load shedding was another mechanism that we would do. You don't have
that many other options other than that, and load shedding is just a way of
trying to fail gracefully and failing fairly. Those were really the main levers
that we had.
Sualeh Asif [7:11]:
How was the Shopify infrastructure organization organized? How did it — how was
it in 2013? How does it go from like tens of people to hundreds of people or
thousands of people now?
Simon Eskildsen[7:22]:
There was like probably five to ten people in 2013, like maybe a few thousand
requests per second, and then we had a team of five to ten people that were
engineers without an ops background. This is like when people just started
talking about DevOps, right? Where DevOps was, oh, maybe someone can both SSH in
and run the Linux commands and also write software. That was almost a novel idea
back in 2013. And so that started to happen at Shopify. You had a bunch of
people who were just doing performance engineering stuff in the application and
a bunch of people doing servers, and those teams ended up merging. Back then it
was really scary. It's like, well, you're going to have a developer writing Chef
to configure the infrastructure, and that's what we were figuring out then. So
those teams merged and we ended up calling that the production engineering team.
I think Facebook did a good job pioneering this pattern and then it just started
breaking into different teams.
Sualeh Asif [8:13]:
So this is actually really interesting. One of the things that I've learned is
that actually it wasn't just, you know, there was a bunch of great companies
scaling in 2010 — there's, you know, early Stripe and early GitHub and early
Shopify and so on. There's like tons of different companies and they all
collaborated with each other. How are the teams, like different teams, you know,
infrastructure teams collaborating across the organizations to build tools? What
are the great tools that came out of it? What are the great open source
libraries that people use now that have stories in the infra wars of the early
2010s?
Simon Eskildsen[8:45]:
I was just talking to — we probably both know Sam Lambert who runs PlanetScale.
We were talking about this, how in the 2010s a lot of these lessons, and it's
probably still the case, are not written; they're shared on phone calls. Like
you and I had some of those phone calls in the early days of Cursor, like how do
you do this and all of that. And it's a lot of wisdom that honestly the models
can't really train on because most of this is just in a bunch of people's heads.
And in the 2010s there was a bit of a collaborative intelligence between these
companies.
Sualeh Asif [9:13]:
Walk me through some of the people you talked to.
Simon Eskildsen[9:16]:
So we talked to Zendesk, who were also scaling Ruby and MySQL. Intercom was also
using a bunch of the libraries that we built for caching and also Rails and
MySQL. GitHub, of course, GitHub built things like this open source library
called gh-ost, which is something that uses the MySQL bin log to run migrations
of the schema. SoundCloud built this thing called LHM, which was a way to use
triggers in MySQL to do schema migrations. At Shopify, we wrote a project called
Toxiproxy. I don't know, have you ever heard of this project? Yeah. It was a
project I started because everything just started failing with all these
different services that we had and we needed a proxy where we could guarantee
that if we took down all these different services that things wouldn't fail. So
for example, we had a database that was managing all the sessions and carts and
we needed to ensure that if that database failed, all of Shopify stayed up so we
could write tests against it. I think a bunch of those companies used that. I
think Sam and I, like at GitHub and Shopify, we were on the phone together and
we're like, oh, what are you doing about this with Ruby and how are you scaling
MySQL? And is this proxy good? And what are you doing about connections? And,
you know, we would send patch files together to get Redis to scale better. And
there was just a lot of this collective wisdom, a lot of these Ruby on Rails
MySQL shops in the 2010s.
Sualeh Asif [10:35]:
What's the story of Logrus? Logrus is probably your most popular library?
Simon Eskildsen[10:43]:
Logrus is a Go library, and I created it because I was in so many incidents
where I was just so mad at myself from six weeks ago for not more intentionally
thinking about what output I wanted to see from the system at a particular point
in time. And so I wanted an API that just forced me to sit down and think about
what information I want to dump out. So the Logrus API is like
logger.WithFields and then you have to type them all out. Now that's really
the only thing that's well designed about Logrus; it does way too many
allocations and I haven't had time to actively maintain it very much, but it
took off.
Sualeh Asif [11:22]:
What does it look like when a library takes off?
Simon Eskildsen[11:25]:
I think it has like 25,000 stars and I just kept finding people kept telling me,
oh, we're using Logrus, I really like it. And so people got the idea of just
structured logging — this was before OpenTelemetry and all of that, so I think
it just clicked for people.
Sualeh Asif [11:43]:
So one very interesting part that you mentioned in the Logrus story is, you
know, you want to write things intentionally, be very careful, be thoughtful,
like in advance when you'll actually need to be thoughtful. What are other great
engineering principles? What are the Simon engineering principles that one
writes in an AGENTS.md?
Simon Eskildsen[12:04]:
The way I think about software in my head is that over time, the software has to
age well. And over time, the software is under strain from time, patterns
changing, the language changing, scale, and lots of people working on the same
thing. The only thing that I just keep coming back to, which is a major
inspiration also for turbopuffer, is just to make it as simple as possible. I
worked at Shopify for almost a decade, and it's rare these days to have worked
in one code base, one company for that long. And what it taught me was how
software ages, because I saw so many projects where people would spend a long
time writing an RFC and doing a big project on something, and that was not a
predictor of that software aging well. And then I saw sometimes where on the
infrastructure team, we just had to hold something together with spit and
bubblegum, as my boss used to say, and it would age phenomenally. Like that spit
and bubblegum would be perfectly in place and holding water five years later.
And so I think the biggest thing that I just learned is just to keep that
simplicity — they surprise you and complexity has to be deserved. That's the
underlying principle, I think a lot of things follow from that. I spend a lot of
time thinking about how different things are gonna fail if you 10x or 100x the
scale. Like if you're doing any infrastructure change, it has to be able to
scale 100x, otherwise there's no reason to make it. You're just playing musical
chairs. But the aging well and letting simplicity start and complexity be
deserved is extremely important and it's fundamental to how we've designed
turbopuffer. The other thing is that being on call on the last resort pager of a
piece of software where real people around the world are losing millions of
dollars per minute of downtime is an enormous responsibility. We were maybe six
to eight people on that pager. And if you got paged, you knew it was on you to
bring the site back up. And I think that taught me how to write software in a
way that — it changes you for better or worse, right? And so with turbopuffer,
for example, it was just like, what's the piece of software that I'm willing to
go on call for? And it has to be extremely simple and very easy to debug. And
back to Logrus, right? It's how do we make this as easy to debug as possible?
Because every line of code that I write is a liability for someone to be on call
for at 3 a.m.
Sualeh Asif [14:35]:
Well, this is super interesting. I guess one way to frame the question is, you
know, as you are RL-ing the models, you kind of are writing a constitution that
the models have to follow, because not only are they trying to, you know, get
better and better at passing a certain suite of tests, but also we want them to
write good quality code, you know, in the same way that humans have this
constitution that, you know, it's very simple, it's not very complicated, but
it's very widely applicable. Is there a software engineering constitution that
we should be using to train all the models such that the code that comes out is
good quality, stands the test of time? I think a lot of people generally are
worried about slop that the models produce. We kind of want to train them to not
produce slop. What's the constitution so they don't produce slop?
Simon Eskildsen[15:22]:
When I talk to the models about designing software, it feels like they want to
design systems like an eager undergrad who's read way too much Hacker News. I
would ask you how you would RL a model to encode principles like that of just
simplicity over everything, because it's a certain set of trade-offs, right?
Like in turbopuffer, we keep everything on object stores, there's no state
anywhere else. And it's like, yeah, if you want low latency for writes, you
gotta look somewhere else because we're not going to give you that because I'm
not willing to accept the trade-offs. I don't know how good the models are at
navigating trade-offs like that. It feels like they're very eager to design a
very perfect system and not that eager to try to design something that's very
simple and will age well under those kinds of pressures. But how can you RL
that?
Sualeh Asif [16:07]:
I think one can come up with various ideas, right? By default, you're trying to
— even just the thing as like, write something as simple as possible, and when
you're judging between two correct solutions, prefer the solution that's really
simple. And, you know, another one that you could try to encode for is make sure
it's short. You know, if you enforce the model to produce short things, minus
the code golfing aspect, shorter things are generally simpler. Don't write
something in a hundred lines that you can write in ten.
Simon Eskildsen[16:37]:
But I don't think it's always about the lines of code, right? Like the lines of
code sometimes of, you know, something like turbopuffer might accept all of its
writes into Kafka or something like that. But now you're operating this whole
other system that might be less lines of code, but the system complexity is a
lot higher. Do you give the models a constitution?
Sualeh Asif [16:54]:
Yeah, we actually write down the things that we think of as great, great code.
And that's important because if you don't write such a thing, the models will
write all sorts of crazy stuff that is unbelievable.
Simon Eskildsen[17:05]:
Yeah, I think also just thinking about all the graceful failure modes, right?
And just the recovery of the software is also something that we've always tried
to preserve — the property that you can shut down every single server and you
lose nothing, like there's no lost data. And it's surprisingly difficult an
invariant to uphold. And so I think the other thing I think about with software
aging are just what are the invariants in the system that I care about, right?
Like we've both done competitive programming, right? And you're always thinking
about what are the invariants in the system that has to hold under every
condition, because otherwise you know that it's going to fail some test at some
point.
Sualeh Asif [17:38]:
You write a famous blog. Famous-ish blog.
Simon Eskildsen[17:42]:
A stale famous-ish blog maybe.
Sualeh Asif [17:44]:
What's the story of starting a blog? It was very, you know, inspirational for me
when I read it back in the day where it's like — for people who don't know, the
blog basically walked through, you know, for seemingly complex engineering
problems, how can you break it down into Fermi estimates and actually get a, you
know, fairly reliable and good estimate of what the system will perform under
weird, weird shit that is otherwise really hard to estimate.
Simon Eskildsen[18:10]:
I had no idea that you read it. Did you read it before we got to know each
other? So what you're talking about is napkin math.
Sualeh Asif [18:16]:
A bit, yeah, a bit before.
Simon Eskildsen[18:18]:
Napkin math came out of my role at Shopify where I was a principal engineer. And
one of the things you do as a principal engineer is you're reviewing a lot of
the science. So you're reviewing the science of "I want to build this product,
it's going to use the database in this way." And something that I found myself
repeating a lot — people would come to me with a benchmark and say, this is how
I expect this database to perform. Like I still really don't like benchmarks
very much, especially not when you're trying to make a technical argument,
because benchmarks are like a point in time. They don't tell me anything about
what the fundamental properties of the system are. And so the lesson that I felt
like I was preaching in so many of these reviews talking about like, okay, this
might be the benchmark, but the fundamental properties of the system are that,
you know, if you're trying to do a search query, for example, you can just do a
little bit of math and figure out how many gigabytes of data that you have to
move to serve the query. You see, okay, to service this query, I have to move
around a gigabyte of memory; DRAM can maybe do about 10 to 100 gigabytes per
second, so this should take somewhere between like 1 and 10 milliseconds. And
you would come back with a benchmark and they're like, well, it takes five
seconds, so we can't use this database. And it's just an unacceptable
explanation to me because there's a gap here between my first-principle
understanding of the system and my dumb high school multiplication math and how
the system is performing. So that gap between the first-principle understanding
of the system and how the system is actually performing in the benchmark — we
have to close that gap before we can conclude anything. And that gap is either,
like, my stupidity, like this math is wrong, or it's that this system is not
performing, but like one of us is wrong. And unless someone could close that
gap, it was just like, you're not making a compelling argument. The benchmark is
not persuasive unless you can explain that the benchmark only tells you how
close you are to the theoretical floor. And so I just started a blog to try to
explain this and to give myself a bit of a cadence of like, okay, you know,
you're trying to do this join, how long might that take? And then just doing a
bunch of math, running the actual test and seeing what the difference is, and
then trying to reconcile that gap.
Sualeh Asif [20:39]:
What's your favorite blog post in the series of blog posts you wrote?
Simon Eskildsen[20:43]:
I really like the one about TCP windows. Have you read that one?
Sualeh Asif [20:48]:
It eats away at every single, you know, scaling system.
Simon Eskildsen[20:52]:
It does. And it became very important to winning a very important deal at
turbopuffer, actually, at some point — this blog post. So this blog post is
basically — this was a prompt that came to me at Shopify where someone was like,
okay, why does it take three seconds to load a page in Australia, right? So
you're going from Australia to US East. I'm guessing that round trip is probably
250 milliseconds — there's like a pretty good undersea cable probably over the
Pacific and then you're running the 60 milliseconds cross continent, probably
around 250 to 200 milliseconds, right? But the page load was taking three
seconds on like a vanilla Shopify store. You just spin it up and like I know
that the Ruby and stuff is taking like 10 milliseconds. Makes no sense. And this
just haunted me then. And then I went to visit the site and I would refresh and
it would take less time, but it was not a cache hit. I would skip the caches and
then it would take 260 milliseconds. And just like, what is going on here,
right? There's like three seconds versus 260 milliseconds in my understanding.
Like what is this gap? Is it my stupidity or something like, not working here?
And so I dug into it and spent a bunch of time in Wireshark looking at it. I'm
like, why is it going back and forth over the Pacific so many times? And it
turns out that in TCP, what you do is when you open a connection — so, you know,
Sydney dials to US East — US East will only the first time send 10 packets back
because they're trying to negotiate how big the link is between the two and it
has a very conservative default. So it'll send 10 packets. The packets are like
1500 bytes each. And so a website that's like 15 kilobytes is going to load
faster on most machines than one that's 16 kilobytes. And I just sort of deduced
this from Wireshark. So this website is in like hundreds of kilobytes, so it
does 15 kilobytes and then TCP says, oh, okay, I guess the link is big enough
for 15 kilobytes. Let's try 30 kilobytes the next time. So now you transfer 45
and it keeps doubling. But now you're doing a lot of round trips back and forth
to negotiate the size of that link. And so what I realized is, okay, well, if
you just tune the Linux kernel's TCP settings on both ends to send 100 packets
into the first round trip, this will go a lot faster. The downside is if you
have a very, very bad uplink, you're going to lose a lot of packet loss, but in
general, it's probably a better setting. And this became applicable even at
something like turbopuffer at some point.
Sualeh Asif [23:05]:
So that brings us to one of my other favorite questions. So there's a two-part
database question. So number one, what are the great databases of the past? You
know, what are the systems you found very inspirational? What have you learned
from them? Walk me through your understanding of the evolution of databases.
Simon Eskildsen[23:25]:
So there's like I think there's a couple of angles to this. The way that I think
about it is that about every 15 years, the ingredients are in the air to build a
new database. A lot of databases come around, right? I'm sure there's thousands
of databases in production around there, but in terms of big databases, it sort
of begs for a new platform, right? Like Oracle, for example, and MySQL and
Postgres are built in the 90s. We have the web, and we have a lot of all these
SaaS companies being built and lots of data going into databases. That was the
first, I would say, wave of big production databases. There were some in the
80s, but this is — 80s and 70s are so far before I'm born, so I don't have a
great understanding there, but obviously there are databases there often on
mainframes and so on. But the 90s was when the first grade of big database
companies were built. Then about 15 years later, you have a new workload, not
just, you know, websites and so on trying to store data and applications
starting to store data, but you had these large-scale OLAP workloads. And so
then you have Snowflake and Databricks and a bunch of other companies being
built around this new workload. I think big database companies basically are
companies where every single company on earth has data in that database either
directly or indirectly. And so, you know, while the proverbial textile
manufacturer in Bavaria, Germany is not going to go out and buy Snowflake
directly, almost certainly they're using some product that's using Snowflake or
Databricks, right? Or tens of times. So their data is probably in that database
tens of times. And those are how the biggest database companies are built. And
the biggest database companies — I think now there's a moment where that's
happening again, where there's all these AI workloads that are being trying to
be connected to data. But that's one way I think to see it in the past. In terms
of inspirational databases, SQLite is the first one that comes to mind for me.
What I really like about SQLite is they just have this hardcore minimalist
philosophy in everything that they do. I think the best example is, and this
would go in my software AGENTS.md constitution, is to try to get as much
pressure on every code path as possible rather than having separate ones. In
SQLite, I have a phenomenal example of this. And it's that normally when you do
a join ad hoc, you will do some version of like a nested for loop or a hash join
or something like that, and you implement that as a particular path. SQLite
basically will construct an ad hoc B-tree in memory to perform the join, which
is the same code path as far as I understand of the B-tree index they build when
you actually build the index on disk. So they're putting more pressure through
that single code path, more optimization yielding to more and more query plans.
I think that's brilliant. Obviously, another database that I admire is like
Google Cloud Storage and S3 and these blob storage where they have very, very
few APIs, but those APIs have a very, very consistent histogram of latency,
almost infinite scale, and they work and they're extremely reliable. I really
like systems that have very few primitives and you just know they work and you
know that they will honor their histogram bounds even when subjected to a lot of
time, like so reliability, but also a lot of scale. So I've certainly taken a
lot of inspiration from that as well. For me, there's also a bit of nostalgia
even just in the old days of just ripping it with FTP and phpMyAdmin and a MySQL
box. I really like that and there's a sense of that that I would love to bring
into the database that we're building today.
Sualeh Asif [27:05]:
So next, there's part two of my two-part database question. One of the things I
don't understand very well about the database industry — this might sound naive
— but most systems that you see in the world have this property that as the
industry matures, there's one winner, and that one winner just stays the winner
forever. So just to walk through a few examples: operating systems, there's a
lot of competition, then there's Linux wins in some ways, or for consumers,
macOS wins, and that just stays the way forever. And, you know, for
virtualization, there's one winner and that winner stays the winning company
forever. And like almost all systems have this property, you know, it's true for
OSes, it's true for anything you use in terms of APIs in the real world. Even in
clouds, there's like basically AWS plus two copies and those are like the
standards forever. There's no real new contender propping up every 15 years of
disrupting a standard. A database is the only example where every 10 years
there's like new companies every year, like someone starts a new database in the
hopes of, you know, becoming the next big database that everyone will use. And
there seems to be this consistent innovation over a period of, you know, half a
century, which it doesn't feel like anything else has. You know, there are no
great infrastructure companies where once the infrastructure company starts
winning, it stays stable. There's like a new one in the category every five
years. People joke to its death that the only other answer to this is JavaScript
frameworks, but outside of JavaScript frameworks and databases, why do databases
have such a property?
Simon Eskildsen[28:52]:
So the mental model I have of this is that to serve new infrastructure software,
there needs to be a new workload. And so the workload that we expect of an
operating system hasn't changed enough for us to require a rewrite, right? We
expect now virtualization and really good isolation primitives from the
operating system. Solaris was a lot sooner to that than Linux, right? Like Linux
only really got good at that even just a few years ago, right? And they were
working on that through the 2010s, but it wasn't enough to disrupt the whole
thing. And so I think with databases, we do see new workloads like every 10 to
15 years, but there's not as many new workloads as I think people think that
really matter. Now, there's a lot of niche workloads, right? So there's a niche
workload in graphs. It's a fairly niche workload, but the big database companies
are built when you start with a sensibly niche workload and then expand from
there. I think MongoDB is a great example of that, right? They started with very
web, like get up and running very quickly and then they've just expanded — they
do time series, they do graph, they do everything. So I think that's one
property is workload, but workload is not enough.
Sualeh Asif [30:01]:
Why can the incumbent also get really good at that workload? Then why can't
Oracle just keep getting better and there's never a new database company, but
Oracle v1, v2, v3, v4 — it seems like they get disrupted again and again and
again.
Simon Eskildsen[30:05]:
Yeah, so that brings me to point number two. So you need a new workload because
there needs to be some reason for every company on earth to have some data in
this new database. Otherwise, it's not going to get that big. The second thing
that you need is a new fundamentally new storage architecture that is
advantageous for that workload that the incumbent can't get to. So an example of
that, right, is like Snowflake and Databricks are architected with commodity
hardware, either, you know, directly on object storage. And Oracle can't do
that, right? It does not have a separation of compute and storage, so it can't
do these massive OLAP workloads. Even though you could ostensibly do that as an
extension of the design, Oracle has at that point what, 30 years of heritage of
assuming a tight-knit, non-separation of compute and storage. Now, right, we can
build databases that take advantage of NVMe SSDs, and Snowflake and Databricks
didn't do that because NVMe SSDs weren't even in the cloud until like eight
years ago. And NVMe SSDs require you to architect the database fundamentally
differently to take advantage of it. You need a lot of outstanding IOPS in very
few round trips. And that's also what you need to do to do very low latency on
object storage. The other thing is metadata. So the first generation of
databases built on object storage would have had to put all the metadata in a
separate database, which has all kinds of other problems like running on other
people's clouds and running in many regions. And you can build databases only as
of like a year and a half ago that can have the metadata also in object storage.
So when you have those two conditions of a new workload — in the 90s, it was,
you know, computers, internet, and 15 years ago was OLAP with analytics. And
today it's connecting very large amounts of especially unstructured data to AI.
That's the new workload you need. The second thing you need is a new storage
architecture. It can't be copy, right? The search engines before turbopuffer
can't copy it because they tightly coupled compute and storage, and Oracle
couldn't easily copy what Snowflake and Databricks did, which was separate
compute and storage.
Sualeh Asif [32:21]:
What's the story of Simon becoming a fan of databases? You're clearly very
passionate about databases.
Simon Eskildsen[32:28]:
I love databases. I think competitive programming has a lot of just
database-adjacent topics and so you just start thinking in asymptotic notation
and how things are executed. And when I got to Shopify, I was just so drawn to
the people who were working on databases. And the thing that breaks when you
scale a website is generally the database, so it's just always the thing that
was breaking. And so we have to stay ahead of it not breaking tonight at the
flash sale and not breaking a year from now when the flash sale would be 10
times as large. So it just becomes a fundamental bottleneck for scaling and that
just drew me to it — like why these things are hard to scale. And so we just
have to keep working on it and working on it and working on it. And at some
point my model just started shifting from thinking about it as a thing that
executes SQL to just thinking about how the bits and bytes are laid out on disk.
They have so many fascinating trade-offs, right? Like we just talked about
storage architecture being different for different query workloads. There's so
many trade-offs in databases. And I love thinking in trade-offs. Like I love
thinking, okay, well, if we do this, it's going to be better at this, but worse
at this. There's just, especially in search, it's like applying so many parts of
computer science and computer engineering into one particular domain, that
there's just an infinite amount of fascinating problems.
Sualeh Asif [33:44]:
I remember the first time we met, we were — at Cursor we were running into some
Postgres bottlenecks. And one of the most fascinating things at the time was you
laid out, okay, here's how Postgres is architected. I can like make a simple
mental model of like things in Postgres happening this way and things in MySQL
happening this other way. And at the time, even though I had read about Postgres
and understood some of the architecture, I hadn't mapped it down to all the
little blocks how they interact with each other. How'd you learn that? How does
one learn about the weird intricacies of, here's how Postgres is designed in all
of its little bits and components and all the query parameters that one can find
around for those things?
Simon Eskildsen[34:25]:
The short answer is that I can't help myself. I don't think I'm particularly
smart, so I just spend a lot of time trying to dumb it down to something that I
can understand and explain to other people. And I spend a lot of time on a
little notebook and just trying to draw out exactly how did the blocks move
around. And so I think it took me about 10 years to get a very, very good
understanding of how MySQL works. And then when I left Shopify, I was helping
some of my friends' companies scale, and I kept running into Postgres and I was
like, okay, this is cool. Like, you know, I don't know anything about Postgres,
so I just sat down one day and I just spent eight hours reading the entire
manual with a lens of how is it similar and how is it different from MySQL. And
just always comparing and contrasting. So it was easier because I already had
sort of a trunk to land the knowledge on. And so the compromises became very
apparent, right? The way that the indexes work in Postgres is very different
from how they work in MySQL, right? In MySQL, the way that the data is laid out
on disk is dictated by the primary index. So at Shopify, we took advantage of
that by having all of the data for a shop located together. That's very
complicated in Postgres. Postgres handles writes in a very different way that
requires a lot of tuning for the user that MySQL doesn't. And I mean that was
the problem you were running into when we met.
Sualeh Asif [35:45]:
How do you code with AI? Or in general, how do you use AI? Where, is there any
way where models are helping you outside of coding?
Simon Eskildsen[35:53]:
For sure. So, on coding in particular, I just have a Cursor window with a
website open. A lot of what I do is like docs or smaller changes. And I just,
yeah, I just have Cursor running. I use a synchronous agent and then I just
choose a model based on what I need to do. If I just need to answer a question
about the code base, I use the Composer model because it's very fast and
searches a lot. And then I use different models depending on the kind of task
that I'm doing. But generally I'm managing a few agents inside of Cursor. And
it's especially inside Cursor because it allows me to make it very easy for me
to review the code. I think my contrarian view is that we're still going to be
reviewing every single line of code that goes into the database by the end of
this year because that seems paramount still for the database. Like we have a
couple of individuals at turbopuffer whose job it is to just have the entire
context of the code base in their wetware neural network, and it still works
really well because they have their manifesto and AGENTS.md also embedded in
that wetware, and they make good local optimum decisions and the agents help
them.
Sualeh Asif [37:00]:
One of the things I've been excited about over the last many months is getting
cloud agents to be better and better and better. What would it take for 50% of
turbopuffer's code to be written by cloud agents? And I imagine one of the
bottlenecks of writing 50% of your code is you really want to be sure that it's
verified the thing, you know, all sorts of queasy conditions. How do you think
about testing in the age of models operating for 12, 24, 48 hours? And in
general, at what point does 50% of turbopuffer's code be entirely written by
cloud agents?
Simon Eskildsen[37:39]:
I think we probably should have agents running all the time trying to break the
products before we do or before the customers do, and use cloud agents. I'm sure
some of the engineers are already doing that. I mean, generally, when I use the
model still on core, like core, core Rust and database things, it's still not
making globally optimal decisions. So I feel like the level of when I'm doing
something in the dashboard or on the website, it can get such lax instructions
and do incredibly well. But in the database, there are just so many other
properties of how is this API going to age, right? Like how is the data going to
be laid out? And there's all of these war lessons, right, that the model has not
learned about how to operate this. What has to be true? I think the model just
has to get better. I don't think I have any more wisdom on that. And I think
that I want the model to talk more about how something is going to age, how the
storage, how this is going to make what irreversible decisions on the storage
architecture — like ostensibly irreversible decisions.
Sualeh Asif [38:44]:
So three turbopuffer questions. So first one, what would it take for turbopuffer
to become Google scale?
Simon Eskildsen[38:50]:
To index the entire web? It can already do that.
Sualeh Asif [38:54]:
Yes. Index the entire web and serve the QPS that, you know, Google would serve.
Like actually be usable by Google.
Simon Eskildsen[38:59]:
So, I mean, I don't really know — Google, I'm sure, is a web of like a thousand
services to do it, but we have customers that have indexed the entire web into
turbopuffer and it works and it can do thousands of QPS. Now, I'm sure Google
has their hands on more than the hundred billion or so documents that we've
procured in the data sets that have been indexed into turbopuffer, maybe in the
trillions, but it can certainly be done. There's no reason why that wouldn't
continue to scale. And we've done the hundred billion with p99 of 200
milliseconds and p50 of like 40 milliseconds, so that's quite possible. Now, to
get the relevance to where something like Google is, you might have to do a lot
more, and that would be quite a few iteration cycles. But I mean, you can index
the entire web with not that many servers with the turbopuffer architecture.
Sualeh Asif [39:43]:
Could it ever be built without S3? Is S3 sort of being super strongly consistent
just one of the great engineering feats of the last decade?
Simon Eskildsen[39:53]:
Yes, I think so. Back to the original question — the question about why there's
new database companies coming up every 10 years is that we needed NVMe SSDs,
which were not available in the clouds until like the late 2010s. We needed S3
to be consistent, which did not happen until December 2020, which is
mind-blowingly late. And then we need compare-and-swap.
Sualeh Asif [40:13]:
Why is it so hard for S3 to be strongly consistent?
Simon Eskildsen[40:15]:
I don't know. In Google Cloud Storage, I think it's probably Spanner or
something sitting in front. And so that's a little bit easier for me to
understand. There are more or less no details on the S3 metadata layer. That
metadata layer is presumably very difficult to operate on. And I know that S3 —
yes, their API surface is small, but I know they invest a lot in formal
verification and all these different things, presumably because they have little
bugs that people rely on and they have to make sure that even when they change
the tiniest thing, that doesn't break for a lot of other people. And so I don't
know what makes it so difficult. I think probably the thing that makes it most
difficult is that the system existed for 15, maybe 20 years without having that.
And from the systems that I've seen, it's a very fundamental assumption in the
system and you're going to be engineering for 15 years assuming that this is not
true. Yeah, that would probably take — I'm sure that took them like five years
to do. I'm guessing Google Cloud Storage, because they were fronted by Spanner,
might have been consistent much earlier, if not from day one. But I also know
that they consider that one of the greatest mistakes of S3 that they weren't
consistent from day one.
Sualeh Asif [41:22]:
What else do you like about S3? When I say S3, I guess I mean...
Simon Eskildsen[41:26]:
Big storage. Yeah. I think just the simplicity, right? Like it's tight latency
bounds on very few things. I think it's very, very predictable systems. It
automatically shards and it's infinite. Turbopuffer would not exist without it,
right? Like if this was like 15 years ago, we would have people full time just
racking HDDs and trying to strike deals to get as many HDDs as possible. And
we'd have the problem that you probably have with GPUs, but just with HDDs. I'm
very happy to not have that problem. I don't think turbopuffer would exist. And
I probably wouldn't have started the company because I would not have wanted to
be on call for a product that required racking HDDs and flying to Ashburn every
week to rack more HDDs.
Sualeh Asif [42:10]:
One of the really weird things I've had recently is it's getting harder and
harder to interview people, where one of the great interviewing tricks of the
last decade was instead of interviewing them on the blackboard, give them a
really complicated code base and let them try to do something in a complicated
code base because naturally doing things in complicated code bases requires an
incredible amount of RAM. Like you need to be able to hold a lot of things in
your head while still being able to produce net new stuff other than just
getting, you know, stuck or being asked to deliver, you know, day to day. If you
need to deliver in two days something that is required of you, you can just shut
everything else off and actually focus on the thing. Sadly, language models have
just gotten so good that you can't use that interviewing trick anymore. How do
you think we go back to first principles and interview great engineers? Because,
you know, there's obviously drawbacks of doing things like whiteboard interviews
because whiteboard interviews are not the thing you're doing on a day-to-day
basis — isn't exactly the skill that you're testing.
Simon Eskildsen[43:15]:
This is something I spend a lot of time thinking about. I have a document called
Traits of the P99. I don't know if I've ever shared an early draft of this — you
and I have spent a lot of time talking about interviewing over the years. And
it's just a list of traits of the P99. And it's a long list and I can't share
all of them because that would be too easy. But I think that the P99 is someone
that you would describe as fast. It comes out in very, very different ways in a
lot of people, right? You and I can talk very fast. And so that can be
interpreted as fast. But some people are fast because they move very
deliberately and there are not bugs in their code, and they just keep moving
forward and it's just one step forward, one step forward, and it's never two
steps backwards. It comes out differently, but I've never met a P99 that I could
not in some way describe as fast. Another trait of the P99, I think, is that
they have bent something to their will, whether it's software or their
trajectory or something like that. They have just made the machine or something
do what — I mean in Silicon Valley you call this agency, but it's also agency or
facility with the machine itself to get it to do what you need to do. I think
P99s try to the best of their abilities to surround themselves with P99s, and
they probably have multiple moments in their life where they discover that there
was another level and they could not help themselves but to try to get there,
right? You're an immigrant to the new world like myself, right? And so you
probably at some point tapped out of the P99s in your local community and you
went looking somewhere else, right? And I think one of the interviews that we do
is an interview that we call the life story, and our recruiter Jen spends an
hour and a half with people just going through their life, like all the way from
when they first were introduced to something that they were excited about. And I
think that the P99 got very excited about something very early and they
continued to seek out the next nine, like P99 to P99, right? And for you, I'm
sure it was like going to MIT, you discovered another level, right? Being an
IMO, you discovered another level. And I think that's another trait to be looked
for. And so I just have a list of these — like they're obsessive, like they
can't help themselves but get in the weeds on some detail they find particularly
interesting about something. They just can't help themselves. They don't remain
at this level, like it's not like a monotonous abstraction level. They can't
help themselves but go up and down. And so I have a list of these and I just
look at them after I've interviewed someone — after I talk to someone, I have a
particular interview that I do where I look for that. What do you do? What do
you look for now?
Sualeh Asif [45:57]:
Actually, I've been thinking about this a lot. I mean, I don't have a great
answer. I do think probably we should go back to logic puzzles where I think
there's something that comes out when you go into something like these weird
probability questions, where probability questions require something like
clarity of thought. In periods of stress, having clarity of thought is more
important than one realizes. And of course, a probability question itself, you
know, can be easily solved if you're, you know, very well versed in probability,
but also just being able to stay calm and then say, we'll reason through this
together. So for most people, it's actually a bonus. Like the whole reason to
pick a probability question is to pick for people who, you know, don't have
probability experience. It doesn't have to be probability in general. It could
be anything where you're like — I've heard of various questions where you come
up with scenarios and the scenario is something that is rather tough and the
problems get tougher and tougher over time. And it's just really interesting
because, you know, if it's very easy to produce code, probably the thing that
matters is can you clearly think through the system, and then can you clearly
articulate what you want out of the system and what you don't want out of the
system? What are the invariants you care about? So all of those things — like
back in the day, if you had to trade off between a philosopher and, you know,
someone who's a workhorse coder, you could probably pick the workhorse coder
because they had refined their skill at Next.js or something. But actually
that's become less valuable over time, so you probably want people who are just
incredibly thoughtful. I think one of the things you mentioned that I think is
really underappreciated is just always having the person who's one step forward
and never two steps backward is vastly underrated in the world. I think those
people are just very, very, very calibrated and you will always trust them on
the hardest problems in the company and that kind of skill, even though it's
hard to interview for, is pretty important.
Simon Eskildsen[48:02]:
I think one thing that is probably the best way that we try to figure out if
people have clarity of thought is to ask them how did they design a system and
then see how they ratchet up complexity and navigate the trade-offs, right? That
clarity of thought of like — you're going to start with the simplest thing and
then you're going to attack it with some scale or whatever, and then we talked
about like the serving complexity earlier, right? Like the system you —
sometimes engineers often gravitate towards just designing the perfect system
that doesn't really have any trade-offs, but that's not one step forward, one
step forward. That's like trying to take 10 steps forward in the first step and
you're going to fail because you're going to make an assumption that's wrong. I
think the P99 navigates that ladder of complexity extremely elegantly.
Sualeh Asif [48:45]:
What is the next frontier of databases? 15 years from now, what's the next
frontier? Do you think we have the last database?
Simon Eskildsen[48:53]:
Definitely not. There's going to be hardware advances, right, in the next 10 to
15 years.
Sualeh Asif [48:58]:
GPU databases?
Simon Eskildsen[48:59]:
It might be — like people have tried that, but I don't think it's quite ready,
right? But it would make sense to me that the GPU becomes more and more general
purpose, could do more and more instructions and can move more and more
bandwidth. So that might be the next platform, right? Does it take 10 years
again? I don't know. The GPUs are evolving very fast. So I think that's one go
at it. I think that it's been a pretty consistent pattern that machines — like
this has happened in CPUs, it's happened in disk, and now it's happened on
object storage — where a lot of speculation. So basically having a very large
amount of outstanding requests at once in few round trips, that's been a good
bet. And that's also how GPUs work, right? You try to give them a lot of things
to chew on at once, get it back, and then do something else conditioned on that.
So like making everything good for speculation and predictability has worked out
really well.
Sualeh Asif [49:47]:
Would you be surprised if we just have the last database? We won't need more
databases in the future. I actually think — there's like a controversial opinion
— I think turbopuffer and, you know, OLAP plus Postgres is probably combined, or
like the levels of scale that we found to hit them — maybe not Postgres, but
something like Vitess plus massive scale — I think we've hit everything we need.
I don't think we'll need another one.
Simon Eskildsen[50:11]:
I would hesitate to say never. I would love for turbopuffer to be the last
database.
Sualeh Asif [50:16]:
We'll call it a day. Thank you so much for, you know, coming by the office. I
feel like I've learned a lot from you over time, both in, you know, architecting
great systems but also, you know, building high-performance engineering teams.
And thank you for being a great partner to Cursor also. There are, you know, so
many points of Cursor where life would have been very hard if we didn't have
Simon to rely on. And yeah, thank you so much for coming down here.
Simon Eskildsen[50:39]:
I really appreciate that. I mean, you know, we've had many conversations on the
phone about infrastructure and you have a good enough team now that you don't
call me anymore. But I do kind of miss it. And it meant a lot to me. And it
means a lot for you to say that because you've built a great company to have
been a very, very small part of. It really means a lot.
Sualeh Asif [50:57]:
Thank you so much.