back to article Spark man Zaharia on 2.0 and why it's 'not that important' to upstage MapReduce

Spark is the open source cluster computing system started in 2009 by Matei Zaharia, when he was but an 'umble PhD candidate at Berkeley's AMPlab. Some people hope it will become the logical successor to MapReduce. Donated to the Apache Software Foundation in 2013, Spark has been backed by IBM. Proponents of Spark say it is …

  1. Anonymous Coward
    Anonymous Coward

    so last month's spark API's are obsolete already?

    Spark 1.6 came out in January. It's "dataset" API obsoleted the experimental "dataframe" API from last spring, which obsoleted the "RDD" API from before that.

    Now Matei is saying that Spark 2.0 will change those APIs. Again?

    FFS: give us some stability alongside the features. Say what you like about MapReduce but old code still works, as does its (somewhat clunky HDFS API). While for SQL, queries sitting around from 1998 still work (if you pay Oracle enough). But Spark just won't stabilise it's APIs at all, chucking out new versions so fast that you are two versions behind before your project is ever ready. Which is turning out to be one of the things holding us back from production: the version we are trying to use is obsolete and unsupported by the time our code is ready. We're adopting a policy of "play with in the notebook, but not for production". Which is a shame, as the underlying system works well.

    Maybe Spark 2.0 will stop breaking code that's three months old. We'll have to wait until the summer, for Spark 2.1 to see if they're trying to do that.

    1. Anonymous Curd

      Re: so last month's spark API's are obsolete already?

      No, DataSet and DataFrame are distinct APIs, and neither obsoleted the RDD API.

      RDD is the low-level, nuts and bolts API, working directly on collections of objects, for when the other higher level APIs just don't do what you want.

      DataFrames are what they say on the tin, with the trade-off that your data must be tabular and you lose some interaction patterns (e.g. lambdas) and compile-time type safety, but get easy performance gains (i.e. transparent optimisation) for common query patterns, and easy data handling in many use cases.

      DataSets are the compromise between the two. You get the type safety and low level Object-oriented handling of RDDs with the expressiveness and optimisation of DataFrames.

      And yes, 2.0 will change those APIs. Because Spark uses Semantic Versioning. That's what 2.0 means. In return DataSets, which are supremely useful, will stop being experimental.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2022