Spark isn't fast anyway
Because there's too much memcpy between processes (doesn't matter whether it's Python, C#, Scala or whatever) which is why people are moving to Apache Arrow.
Good news landed today for data dabblers with a taste for .NET - Version 1.0 of .NET for Apache Spark has been released into the wild. The release was a few years in the making, with a team pulled from Azure Data engineering, the previous Mobius project, and .NET toiling away on the open-source platform. The activity was …
This is a little uninformed. Apache Arrow and Apache Spark aren't even remotely comparable. Arrow is a data serialisation format and Spark is a distributed computing framework.
Spark applications running exclusively in JVM languages (Scala, Java) do not - unless directed to - perform any memory copying in the way you describe, and have never done so. All executor work happens within a single JVM. Spark applications running in non-JVM languages (R, Python, but mostly Python) historically did perform a large amount of copying between processes when running UDFs defined in those non-JVM languages, but not during "normal" operations.
This presented a significant performance challenge particularly for PySpark users who are used to a Pandas-like workflow of defining a custom UDF and applying it to their DataFrame. This has been almost entirely resolved as of Apache Spark 2.3, because PySpark applications have changed their memory process format to a common, zero-copy one make the process efficient. They've done this, funnily enough, by adopting Apache Arrow.
Biting the hand that feeds IT © 1998–2021