
>Hadoop (technically a distributed file system)
I mean if you want to get technical he's a stuffed yellow elephant. Apache Hadoop is a project loosely encompassing at least two distributed filesystems, a resource manager, two general-purpose compute frameworks and a massive grab-bag of connectors, adapters, protocols, widgets and gubbins depended on by the rest of the industry. But that's neither here nor there.
I think the scale up vs scale out narrative gives away the game for MotherDuck. We haven't spent the last fifteen years scaling out just because it's fashionable, and we're under no delusions about performance. A bash sort | uniq is gonna run rings round a single-thread Spark df.distinct and we know that. We scale out because we have to. He's partially right that most queries don't run on 100TB scale, but I think most people would be surprised at how often that happens now. Many orgs *are* routinely slinging that kind of data volume around now because it's cheap and easy. Many more have ambitions to do so. But that's only part of it. We scale out to build in system tolerance to node failure. We scale out to distribute the work over more nodes, allowing us to use smaller/cheaper machines. We scale out to get more parallel IO - most workloads are IO, not CPU bound. The list goes on.
I understand why they're doing this story. They need to build a market. Otherwise this ends up being just a component in someone else's product. A shinier, sexier, analysis-oriented version of SQLite with VC Backing, but ultimately just another embedded SQL execution engine. So they want to be the plucky upstarts fighting the big data orthodoxy. Make data small again, if you will. But it's a stupid narrative to build because it doesn't go anywhere. En embedded SQL engine is exactly what duckDB is. And it's very very flipping good at it. Lightning fast. Trivial to use. Easy to embed in anything else. I'm looking at a benchmark where it's running rings round our bespoke, data processing framework, beating it by 20+% on TPC-H workloads once we've got data onto a single node, with no config or tweaks or tuning. But therein lies the rub. We still need to get the data in and out. And that needs a product. If Motherduck start building a data platform they're quickly going to discover how hard that is. If they don't all they'll ever be is someone else's go-faster button.
So because of that I'd be happy to bet they'll be acquired by a Snowflake or Databricks at some point in the medium term future. I'd assume both are poised to launch partner integrations in the next few weeks as a prelude to that and to test the waters. Which is a shame, because that inevitably means what is currently a cool, sexy, well-designed little open source widget is inevitably going to be borged into some other Silicon Valley monster corp/cult and disappear.
I'd like to be wrong.