* Posts by stephenpace

2 publicly visible posts • joined 3 Oct 2024

The force is strong in Iceberg: Are the table format wars entering the final chapter?

stephenpace

Databases

Ken: Sure, but then perhaps think about it this way. You're a big company with many databases from Oracle, SQL Server, Postgres, and MySQL. Each database stores their data in a proprietary format. Oracle can't query raw SQL Server files. SQL Server can't query Postgres. But now you want to query all of that data together. What is the best way to do it? Traditionally, you've had to extract that data into yet other database that contained the superset of all of the other data. Perhaps you call that a data warehouse. But what if instead you could use any compute engine you wanted to query that data in place regardless of the original proprietary database? And you could still get ACID compliance and all of the other things you liked about databases. That's the problem an open table format like Apache Iceberg is trying to solve.

stephenpace

The point is you shouldn’t have to move your data to be able to interoperate with it nor is it practical at scale. We’re talking petabytes of data and trillion row tables in some cases. “Export to CSV” doesn’t make sense in that context. Calculate the time and cost to export 1PB of data somewhere and you’ll quickly see what I mean.

Instead, you should be able to query your data with performance and join it in the same SQL or Python statement with data from many sources. Iceberg allows this. In that world, your “native” database is just another silo and real-time access to that data would require continual export to CSV and ingestion into something more performant. Hence why the big companies have invested in this open format.