back to article 10+ users can lead to washout: Data lakes struggle with SQL concurrency, says Gartner

Data lakes are struggling to support more than 10 users when they try to perform the SQL queries that were once seen as only fitting for data warehouse technologies, according to Gartner. Apache Spark is the most widely used processing engine when working with data lakes, because it's a single framework that can do batch …

  1. Anonymous Coward
    Anonymous Coward

    This is an advertorial for Databricks. A "data lake" is a (poorly defined) logical architecture, not a technology, and many organisations use it in BI applications successfully. Even more use it unsuccessfully, but such is data.

    Spark is only one of the engines in use in such architectures. MPP SQL engines such as Redshift, Presto, Hive, Impala, Snowflake and so on do the bulk of the analytical (i.e. high concurrency) work. Spark, as a batch engine, tends to do the good ol' ETL and ETL-adjacent workloads like ML.

    Databricks would like that to change (ETL is commodity), but they don't own the lake and they don't dominate it either. Their "new" "SQL Analytics" product is lift of Apache Impala. Delta Lake is a table storage format and has little to nothing to do with the mode of access or data architecture.

    1. diodesign (Written by Reg staff) Silver badge


      "This is an advertorial for Databricks"

      No, it's not. Please don't accuse us of passing off sponsored copy as editorial -- paid-for articles are clearly marked as such.


      1. Anonymous Coward
        Anonymous Coward

        Re: No.

        That's fair - to be precise I should have said "this reads like an advertorial for Databricks". I in no way intended to imply the author or el reg in general are crafting content for pay or influence of any kind and apologise unreservedly for the implication.

        The general points about Gartner's and Databrick's wilful misrepresentation of the topics at hand for their commercial advantage still stand.

  2. yoganmahew

    "able to handle 19,000 queries per hour"

    Hmmm, 5 and a quarter TPS... this time next year Rodney...

  3. Binraider Bronze badge

    This is not a new problem. Data races and concurrency questions have sat over multi threaded and multi user systems since the dawn of computing.

    It's 2021, and the fact that this is still a question says a lot about the frivolous use of computing resources, because they are cheap and widely available as opposed to using them in an appropriately planned manner.

    EVE online manages market concurrency for tens of thousands of players. The stock markets and banks have been doing rather more, for an awful lot longer too.

    Big data and Cloud. Hah. Marketing names to re-invent 70's computing paradigms that had become unpopular. The hardware's changed but not the ideas.

    Ultimately, you need an appropriate cataloguing system or systems for your data, to ensure referential integrity - and appropriate data locking rules. If you've not got them, go back to square one and try again.

    1. TimMaher Silver badge


      Can get a bit jittery if you try and dock there. Plus shed loads of Hypernet scams slowing down the local chat feed.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2021