Abstract Nonsense

⚡️Apache Spark 4.0 released

Apache Spark 4.0 has been released. It’s the first major version update since Spark 3.0 in 2020.

Here’s some of the highlights I’m excited about:

  • A new SQL pipe syntax. It seems to be a trend with modern SQL engines to include “pipe” syntax support now (e.g. BigQuery). I’m a fan of functional programming inspired design patterns and the excellent work by the prql team, so I’m glad to see this next evolution of SQL play out.
  • A structured logging framework. Spark logs are notoriously lengthy and this means you can now use Spark to consume Spark logs! Coupled with improvements to stacktraces in PySpark, hopefully this will mean less grepping tortuously long stack traces.
  • A new DESCRIBE TABLE AS JSON option. I really dislike unstructured command line outputs that you have to parse with awkward bashisms. JSON input/outputs and manipulation with jq is a far more expressive consumption pattern that I feel captures the spirit of command line processing.
  • A new PySpark Plotting API! It’s interesting to see it supports plotly on the backend as an engine. I’ll be curious to see how this plays out going forward… Being able to do #BigData ETL as well as visualisation and analytics within the one tool is a very powerful combination.
  • A new lightweight python-only Spark Connect PyPi package. Now that Spark Connect is getting more traction, it’s nice to be able to pip install Spark on small clients without having to ship massive jars around.
  • A bug fix for inaccurate Decimal arithmetic. This is interesting only insofar as it reminds me that even well-established, well-tested, correctness-first, open-source software with industry backing can still be subject to really nasty correctness bugs!

Databricks has some excellent coverage on the main release and the new pipe syntax specifically.