oi, where's my TARDIS?
"AWS S3 [...] makes changing schemas and time travel difficult."
Odd that this wasn't mentioned in the Doctor Who episode "Kerblam!"
By 2015, Netflix had completed its move from an on-premises data warehouse and analytics stack to one based around AWS S3 object storage. But the environment soon began to hit some snags. "Let me tell you a little bit about Hive tables and our love/hate relationship with them," said Ted Gooch, former database architect at the …
>[Iceberg is a Linux Foundation project]. We contribute a lot to it, but its governance structure is in Linux Foundation
Oh, pull the other one Ali. Everyone and their mum knows the only reason projects are put under LF governance is because - compared to ASF - LF doesn't *have* a governance structure. It's to all intents and purposes a holding corporation for intellectual property, and projects aren't forced to act as fair or neutral brokers. So, yes, Delta Lake as you can grab it from LF is technically open source, but nobody outside of Databricks is able to contribute to the direction of the project or the governance of the project. None of the development happens in the open and none of the contributions or decisions are reviewed in the open.
More importantly for end users, the Databricks flavor of Delta Lake is always several major steps ahead of the Open Source version, with key features like Hilbert Curves for Z-ordering, Insertion-order Clustering and Tombstones/Soft Deletes exclusive to the closed source fork. It's basically shareware - you only get the full-fat experience in the Databricks Runtime. So if you're someone using multiple processing engines on the same data (i.e. everyone) you're going to have a painful time of it.
It's a shame because in many ways Delta has a nicer design than Iceberg (and is definitely way more functional and performant as things stand), but because of Databricks's allergy to proper open source governance, it'll end up an also-ran as everyone else coalesces behind Iceberg and Nessie.
Zorder etc and more since the announcement https://www.databricks.com/blog/2022/06/30/open-sourcing-all-of-delta-lake.html
Apache as good model is highly questionable also… this is an old discussion (some important contributors even questioning the governance by some projects that prevented new commiters to maintain ratio etc).
Regarding interests pretty soon each vendor that is praising that is “open” will say “I have better functionalities” that’s how it works, some won’t ever contribute back…
So at the end we have two full open source projects, which is nice.
This is actually a really good example of what I was talking about.
Databricks are telling you they're "open sourcing all of delta lake". First that verifies what I was saying about the reality that Delta is not an open project. Databricks have done the dev in-house, without any substantial community involvement or influence. At some point commercially advantageous to them they've kicked those features over the fence and upstream to the OSS fork. The design and development was done entirely in private at Data bricks, to their requirements rather than the community's. That means the consumers of this software are beholden to Databricks for roadmap and which features are open sourced when, unless you're prepared to fork the project on an ongoing basis. Some are happy with that trade-off, but let's not pretend that downstream-first model is in keeping with the spirit of open source development.
The other is that they very definitely have not open sourced everything. I called out the features I mentioned specifically for good reason. Zorder is open sourced, but if you call ZORDER from Databricks you'll get a nice, efficient implementation leveraging Hilbert Curves. Call it from OSS and you'll get an implementation leveraging clunky, basic Zordering that is way less efficient. End result is your data consumers - regardless of what they are - are going to behave differently depending on whether the data writer was paying Databricks for the privilege of running in their platform at the time the data were written. The same goes for insertion-order clustering and soft deletes.
Again, most people are happy with that trade-off, but it comes with some serious disadvantages. I've little doubt Databricks will, at some point, eventually open source most/all of it, but there's equally no doubt in my mind that the approach to selecting what to open source and when is primarily motivated by doing *just* enough to play nice with Athena and Synapse and Snowflake, without giving away what they see as "their" IP. For people operating a hybrid estate across on-premises and cloud, or across a diversity of cloud vendors, that's going to be increasingly problematic as you grapple with having functionality differences at the very foundation of your data platform. Iceberg, once it catches up to Delta functionality- and performance-wise (and it will), won't come with that baggage.
That leaves less commercial advantage for those selecting Iceberg as the foundation of their services, but that's a good thing for consumers of those services. Building competitive advantage at the base, storage layer is simply a way to build lock-in. I'd much rather my suppliers compete at the business end of things making my teams more effective and efficient, rather than deep in the technical weeds building defensive architectures between one another.
Your statement on Zorder isn't true... it's not a clumsy implementation, it's the multi dimensional one (https://github.com/delta-io/delta/issues/1134) that is good enough for mostly use cases.
Your pretty detailed example against Databricks is also true if a customer start using Iceberg in Snowflake proprietary version of Iceberg bits, optimizations etc will be there but it is not open sourced. I don't follow your purism also, Iceberg has a company behind it that pretty soon will have their benefits only in their cloud (if doesn't have it already since I'm not following them directly).
Comparing LF with ASF, people forget the most influential open source project in history of computing, Linux kernel, is on the Linux Foundation and NOT ASF. Some think ASF's governance prevents undue commercial interests, I can give you a long list of examples of major commercial interference in ASF projects.