back to article Cloudera, MapR, IBM, and Intel bet on Spark as the new heart of Hadoop

Cloudera has rallied four major companies behind a scheme to tie two open source projects together for the benefit of the Hadoop community. The partnership between Cloudera, IBM, Intel, DataBricks, and MapR to port Apache Hive onto Apache Spark is due to be announced this week at the Spark Summit in San Francisco. El Reg heard …


This topic is closed for new posts.
  1. Anonymous Coward
    Anonymous Coward

    impala out of fashion?

    Doesn't sound like a vote of support for Impala. Intel presumably care about keeping CPUs busy in every Hadoop distribution, and IBM wouldn't back impala -given their own databases. As nobody but Cloudera work on Impala, putting people (back) into Hive would boost their main competing tool

    1. Anonymous Coward
      Anonymous Coward

      Re: impala out of fashion?

      They're different tools for different jobs. Impala is faster than Hive-on-Spark in almost all cases currently, and will continue to be so for the cases where Impala is meant to be used, due to it being written very close to the metal.

      It helps if you understand the Hive vs Impala difference as things stand. Hive is very much a big, batch tool, generally used for transformation or very expensive whole-set analysis. Impala covers a lot of those use cases but is restricted by your memory pool; in reality it is a business intelligence, column-scanning SQL engine. Pair it with Parquet and it will happily outperform Oracle/Teradata/Whatever in BI-type queries.

      It can be used for Hive-like transformation jobs (we use it a lot for small-scale [sub-100TB] ETL/ELT), but there's a definite limit on its scalability. It isn't built for anything but low latency in-memory column scans and probably never will be.

      Spark on the other hand is very much MapReduce-like, but able to run entirely in memory where possible *and* more efficiently structure the jobs. Spark's native, transparent DAG makes MR's manually configured shuffle-and-sort phase look like something from the 1960s. That makes it a perfect candidate for replacing MapReduce as Hive's execution engine.

      We're actually seeing this across the stack. Now Spark is 1.0 and Kerberos-friendly, the major Post-MR languages like Cascading and Crunch are pivoting to target Spark as their prime engine. Throw in the sheer loveliness of Kite SDK and the Hadoop ecosystem is looking superb for the coming year.

      1. Anonymous Coward
        Anonymous Coward

        Re: impala out of fashion?

        You were doing so well and then you had to go and ruin it and add the "Pair it with Parquet and it will happily outperform Oracle/Teradata/Whatever in BI-type queries."

        As you so clearly point out, Impala is predominantly a memory based query engine, and comparing it with predominantly disk based systems which have been built to handle data volumes much greater than the size of memory isn't particularly useful.

        I'm certain that for certain queries Impala + Parquet will beat those RDBMSs, but I'm equally certain that the reverse is true too. For example, Impala pretty much dies a sorry death with large joins which are key to many BI queries (your definition of BI queries must be extremely narrow).

        When we get to the sate of recognising that we have many tools in our toolkit and the real challenge is about getting data in front of business users in the most effective way possible instead of focusing our efforts at knocking specific tools based on personal knowledge (or prejudice), then we'll all be a lot better off.

        1. Anonymous Coward
          Anonymous Coward

          Re: impala out of fashion?

          You have a point, but you're wrong. I'd love to trade specific benchmarks with you, but the wonderful world of DeWitt obviously prevents that. So, I'll leave this with a couple of general points:

          1) "Disk based vs memory based"

          You're right that Impala is, in principal, an in-memory system. However, the point must be reinforced that in general, any given Hadoop system is ten times (or better) cheaper than the equivalent RDBMS appliance. When you can buy your boxes by the thousand from China at ridiculously low costs, it becomes trivial to establish memory pools on the order of Petabytes. This is a common use case for Impala in the enterprise, and it's one you will spend multiple tens of millions (dominated by license and consultancy costs) trying to match with Oracle or Teradata. Yes, you're limited* to the size of your memory pool, but an Impala memory pool is almost always going to be cheaper than Oracle RAC or Teradata.

          *If you're doing truly insane workloads, there's nothing stopping you sacrificing your latency and going from milliseconds response to seconds/minutes by regressing to Hive (on MapReduce/Tez/Whatever) - same data, same hardware, same software packages, same code, all under the same license. In those cases it's not like you'd be getting sub-second responses under Oracle anyway. Frankly I've never encountered a customer with big enough BI workloads to run out of memory but not enough spare cash to just buy another terabyte or ten of RAM.

          2) Join performance

          Impala performs as well (usually better) as any other system at joins. Admittedly this wasn't always the case, but this was usually down to people doing Silly Joins, trying to replicate their old data-warehouse style workloads in Hadoop without re-engineering their data, and doing it on pre-Parquet data formats (RCFile was particularly poor at joins). Back in reality, where people actually do the sensible thing and thoroughly denormalise their analytical datasets, Impala always outperforms all its major open source competitors[1] and almost always outperforms An Unnammed Commercial Appliance Competitor[2] on the TPC-DS workload[3], which is join-heavy and not at all optimised for Impala.




  2. Anonymous Coward
    Anonymous Coward

    "it's trying to throw its weight around and take more of a leading role in the Hadoop community."

    Cloudera already run the community. Everyone else is an also-ran. Frankly the same goes for the business side of things too; Cloudera just need everyone else's name on their initiatives because it makes CTO-who-wants-a-big-data more comfortable if he can ask questions like "So what do Hortonworks think about this?" and get back the answer they want.

  3. W. Anderson

    Hive and Spark for Microsoft Hadoop?

    When Microsoft adopts Free/Open Source Software (FOSS) technologies that they do not and cannot control – only because these FOSS solutions are far superior to their own inhouse development projects, it is always interesting to observe how they utilize and implement these FOSS tools, particularly since projects like Hive and Spark are not developed in a Microsoft software environment, and in most cases Apache FOSS applications run substantially more effectively and faster on a FOSS Operating System (OS) and database foundation.

    Since Microsoft has adopted Hadoop as "their" standard Big Data Processing framework, will the company be updating to use Hive and Spark enabled Hadoop, and can these new Hadoop add-ons even run in a Microsoft environment?

    1. Hortonworks Comms

      Re: Hive and Spark for Microsoft Hadoop?

      start here:

    2. Steve Loughran

      Re: Hive and Spark for Microsoft Hadoop?

      / * hadoop committer stevel; employee at cloudera competitor; speaking for self only; interpret/ignore comments as you will */

      "Since Microsoft has adopted Hadoop as "their" standard Big Data Processing framework, will the company be updating to use Hive and Spark enabled Hadoop, and can these new Hadoop add-ons even run in a Microsoft environment?"

      I don't know about this new work and don't intend to comment on it directly; no point in kicking the impala while it's down.

      What I can say is that Microsoft have done a lot of work on Hive, using the skills of their SQL to team to work on the query planner and execution, as well as their Dryad work which is reflected in Tez. All works on Windows Server and on Azure.

      There's one other thing MS have done that's interesting: Excel integration with Hive and the HCat schema service --you can point Excel at any Hadoop cluster and issue queries with it. With the speedups of Hive 13 you can get fast results on datasets way bigger than Excel has ever supported before. Given that Excel is probably the most widely used end-user data analysis tool on the planet that's pretty sweet.

      Interestingly IBM has been a lot less forthcoming on contributing code, I'm only aware from a few bug reports and patches related to IBM JVM compatibility, and some (immature) code to talk to the softlayer openstack storage layer. The usual "supports OSS/resists OSS" rules have changed at this layer of the stack -which is clearly a sign of cultural shift for Microsoft.


This topic is closed for new posts.

Other stories you might like