back to article The force is strong in Iceberg: Are the table format wars entering the final chapter?

In June, Databricks shelled out $1 billion for Tabular, a startup backer of the open source Apache Iceberg table format, signalling just how important the rather niche topic had become. It was a move which shocked the Iceberg community. There were two reasons for this. Firstly, Databricks — nominally worth $43 billion after $4 …

  1. 'arold

    anyone used iceberg?

    I'm a dinosaur, have only used parquet. Would love to hear an honest assessment of Iceberg from someone who knows what they're doing and isn't trying to sell or influence anything.

    I've always wondered what kind of latency there is on updates. And can you have atomic transactions?

    1. djnapkin

      Re: anyone used iceberg?

      "have only used parquet" - you're well ahead of me.

      From a Medium article

      "Hive keeps track of data at the “folder” level (i.e. not the file level) and thus needs to perform file list operations when working with data in a table. "

      Iceberg solves this by"keeping track of a complete list of all files within a table"

      Files within a table? Not tables with a file? It's a database, Jim, but not as I know it.

  2. Pascal Monett Silver badge

    "something that you can be a part of, that you can control"

    Yeah, sure. Until some multi-billion dollar conglomerate comes into the game and changes the rules.

    Broadcom, anyone ?

    And yeah, we're talking open source. Big deal. When big money comes in, the rules fly out the window of The BoardTM.

  3. J.G.Harston Silver badge

    I'm not sure I understand the issue. It shoule be irrelevant how *you* hold *your* data, as long as you can accept somebody else's data in somebody else's format, and send your data to somebody else in their format. Ages ago I wrote a little database program for a specific task, its native data format is something I put together to make the program coding effecient. However, it will export and import in CSV, so it's irrelevant what my format is.

    1. DanielsLateToTheParty

      Yup. Our team regularly have to deal with millions of CSV rows a minute and there's never an issue.

    2. stephenpace

      The point is you shouldn’t have to move your data to be able to interoperate with it nor is it practical at scale. We’re talking petabytes of data and trillion row tables in some cases. “Export to CSV” doesn’t make sense in that context. Calculate the time and cost to export 1PB of data somewhere and you’ll quickly see what I mean.

      Instead, you should be able to query your data with performance and join it in the same SQL or Python statement with data from many sources. Iceberg allows this. In that world, your “native” database is just another silo and real-time access to that data would require continual export to CSV and ingestion into something more performant. Hence why the big companies have invested in this open format.

      1. Ken Hagan Gold badge

        "Instead, you should be able to query your data with performance and join it in the same SQL or Python statement with data from many sources."

        Isn't that what a DBMS does? Publishing the internal data format sounds like a retrograde step.

        1. ariels-again

          Unfortunately the DBMS doesn't exist here, and you don't want it to exist. If I can read directly from my object store (say, S3), then my compute can soak up data as far as S3 can emit it. For properly sharded data that can be very fast.

          An RDBMS means data has to flow through it - and availability becomes an issue. What you want is a catalog - a list easy to find your data files - and a protocol - a way to read and write your data concurrently and safely. Both Iceberg and DeltaLake provide that.

          1. Ken Hagan Gold badge

            OK. So if the DBMS software has traditionally provided an abstraction layer around the data, you are now taking a gamble (*) that for a wide enough range of applications, the data will fit into a specific structure, which in turn allows a big enough performance win to be worthwhile.

            (*) Given that the DBMS problem domain has been analysed to death and beyond for half a century, perhaps "gamble" is a little harsh!

            1. stephenpace

              Databases

              Ken: Sure, but then perhaps think about it this way. You're a big company with many databases from Oracle, SQL Server, Postgres, and MySQL. Each database stores their data in a proprietary format. Oracle can't query raw SQL Server files. SQL Server can't query Postgres. But now you want to query all of that data together. What is the best way to do it? Traditionally, you've had to extract that data into yet other database that contained the superset of all of the other data. Perhaps you call that a data warehouse. But what if instead you could use any compute engine you wanted to query that data in place regardless of the original proprietary database? And you could still get ACID compliance and all of the other things you liked about databases. That's the problem an open table format like Apache Iceberg is trying to solve.

      2. J.G.Harston Silver badge

        But that's essentially what I said! It's irrelevant how you hold your data internally, as long as others can export your stuff, and you can import others' stuff.

        1. Anonymous Coward
          Anonymous Coward

          When you import or export data, you're making copies of it. Copies which ultimately get out of sync with one another. A lot of datasets aren't static, so then you have the challenge of keeping all those copies of the same data in sync, just to handle different compute engines running against it.

          Having a common table format that different compute/query engines can work with means you completely sidestep a lot of the challenges that come with importing/exporting. You can query across multiple datasets directly.

  4. Michael Hoffmann Silver badge
    Pint

    This being an area I just never touched, the article could have use more info or links of why "tabular data" is such a big deal, versus anything from SQL to noSQL databases down to your average multi-dimensional array.

    Some of the comments helped give me very rough idea, hence the icon for them - some links from regizens would have helped, too, as I'm always happy to learn and read up.

  5. Anonymous Coward
    Anonymous Coward

    Lost and confused

    I personally prefer rectangular tables, but I'm not terribly picky.

    Iceberg is a type of lettuce with no nutrional value?

    1. The man with a spanner

      Re: Lost and confused

      "Iceberg is a type of lettuce with no nutrional value" and is assosiated wth Lizz Truss.

  6. khjohansen
    Coat

    Obligatory xkcd ref

    //xkcd.com/927/

  7. John Smith 19 Gold badge
    Coat

    I smell an Oracle-buys-Java scenario

    IOW Big Corp wants total control of something whose neutral status is a big part of why people are using it in the first place.

    Or (since they have their own product) "Enfold, extend, extinguish" could be at work as well.

    Those who recognise the quote will know whom I'm thinking of.

  8. David in NL Canada

    Missed opportunity... this time.

    A bit of fun, but please look up the infamous "dickieberg" that appeared in Newfoundland, Canada a couple of years ago.

    https://www.cbc.ca/news/canada/newfoundland-labrador/oddly-shaped-iceberg-nl-1.6825578

  9. Groo The Wanderer

    Personally, I think all open source projects that unify the community and help it to move forward with actually solving problems using a new technology should be so well funded and backed. I'm awfully tired of reading about open source development teams virtually starving for the sake of ideals while corporations make millions and billions building on the technology they've "acquired" for "free" because they contribute nothing back compared to what they earn with it.

    1. Anonymous Coward
      Anonymous Coward

      The Strategy Has History!!

      @Groo_The_Wanderer

      So.......it's been the same now for one hundred and fifty years!

      During the nineteenth century in the UK, there were many railway investment bubbles.

      People invested in the new technology....and then lost everything.

      Happened multiple times.

      At the end of this tale of woe, companies like GWR bought up railway assets for pennies.......and went on to make a fortune!

      Ring a bell?

      1. Groo The Wanderer

        Re: The Strategy Has History!!

        Nope, because you're talking foreign history to me - I'm Canadian. We have CN and CP rail and a few local regional players that provide transit services or feed into the CN/CP backbones.

  10. Anonymous Coward
    Anonymous Coward

    Technology is irrelevant

    These systems will differ by 5%

    What matters is the skills and knowledge of the people using the tools. Do the wrong thing with any data storage system…. And all the hardware and algorithm smarts can’t fix it.

  11. CowHorseFrog Silver badge

    Im confused the title mentions format as if it was a file format, but the text seems to be about sql querying...

    1. sworisbreathing

      You might be thinking about querying data from a single application. But think about having many, many applications, each with their own data.

      There are lots and lots of different query engines (some SQL and some not) out there, and often they'll each have their niche of workloads that they are well suited for. In a large enterprise you might have different datasets produced by different parts of your business using different tech stacks.

      If those query engines can't work with a common file format then you end up having to copy data between various systems and spending effort to keep data in sync, manage data latency, etc.

      Having a common interoperable file format cuts down on a heck of a lot of data engineering effort, meaning you can spend your time solving business problems instead of prepping data and copying it between systems/query engines

  12. PapaPepe

    Long time ago, there was a thing called "semantical data model". It was something owned by experts not in IT, but by the experts in the industry the application was meant to serve. Smart application architects did their utmost to have it documented and maintained. When data belonging to two different owners/organizations had to be merged, presence of documented an current semantical models for each source enabled a skilled programmed to build the merge mechanism in a day or two. Absence of semantical data model for either source ensured the merge was only an endless source of troubles.

  13. HMcG

    And on a completely unrelated topic, does anyone remember HP paying $11.7 billion for Autonomy?

    That went well.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like