Reply to post: Re: Databricks announced the full open source of Delta

Apache Iceberg promises to change the economics of cloud-based data analytics

Anonymous Coward
Anonymous Coward

Re: Databricks announced the full open source of Delta

This is actually a really good example of what I was talking about.

Databricks are telling you they're "open sourcing all of delta lake". First that verifies what I was saying about the reality that Delta is not an open project. Databricks have done the dev in-house, without any substantial community involvement or influence. At some point commercially advantageous to them they've kicked those features over the fence and upstream to the OSS fork. The design and development was done entirely in private at Data bricks, to their requirements rather than the community's. That means the consumers of this software are beholden to Databricks for roadmap and which features are open sourced when, unless you're prepared to fork the project on an ongoing basis. Some are happy with that trade-off, but let's not pretend that downstream-first model is in keeping with the spirit of open source development.

The other is that they very definitely have not open sourced everything. I called out the features I mentioned specifically for good reason. Zorder is open sourced, but if you call ZORDER from Databricks you'll get a nice, efficient implementation leveraging Hilbert Curves. Call it from OSS and you'll get an implementation leveraging clunky, basic Zordering that is way less efficient. End result is your data consumers - regardless of what they are - are going to behave differently depending on whether the data writer was paying Databricks for the privilege of running in their platform at the time the data were written. The same goes for insertion-order clustering and soft deletes.

Again, most people are happy with that trade-off, but it comes with some serious disadvantages. I've little doubt Databricks will, at some point, eventually open source most/all of it, but there's equally no doubt in my mind that the approach to selecting what to open source and when is primarily motivated by doing *just* enough to play nice with Athena and Synapse and Snowflake, without giving away what they see as "their" IP. For people operating a hybrid estate across on-premises and cloud, or across a diversity of cloud vendors, that's going to be increasingly problematic as you grapple with having functionality differences at the very foundation of your data platform. Iceberg, once it catches up to Delta functionality- and performance-wise (and it will), won't come with that baggage.

That leaves less commercial advantage for those selecting Iceberg as the foundation of their services, but that's a good thing for consumers of those services. Building competitive advantage at the base, storage layer is simply a way to build lock-in. I'd much rather my suppliers compete at the business end of things making my teams more effective and efficient, rather than deep in the technical weeds building defensive architectures between one another.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon