⚡️Apache Spark 4.0 released
Apache Spark 4.0 has been released. It’s the first major version update since Spark 3.0 in 2020.
Here’s some of the highlights I’m excited about:
- A new SQL pipe syntax. It seems to be a trend with modern SQL engines to include “pipe” syntax support now (e.g. BigQuery). I’m a fan of functional programming inspired design patterns and the excellent work by the prql team, so I’m glad to see this next evolution of SQL play out.
- A structured logging framework. Spark logs are notoriously lengthy and this means you can now use Spark to consume Spark logs! Coupled with improvements to stacktraces in PySpark, hopefully this will mean less
grep
ping tortuously long stack traces. - A new
DESCRIBE TABLE AS JSON
option. I really dislike unstructured command line outputs that you have to parse withawk
ward bashisms. JSON input/outputs and manipulation withjq
is a far more expressive consumption pattern that I feel captures the spirit of command line processing. - A new PySpark Plotting API! It’s interesting to see it supports plotly on the backend as an engine. I’ll be curious to see how this plays out going forward… Being able to do #BigData ETL as well as visualisation and analytics within the one tool is a very powerful combination.
- A new lightweight python-only Spark Connect PyPi package. Now that Spark Connect is getting more traction, it’s nice to be able to
pip install
Spark on small clients without having to ship massive jars around. - A bug fix for inaccurate Decimal arithmetic. This is interesting only insofar as it reminds me that even well-established, well-tested, correctness-first, open-source software with industry backing can still be subject to really nasty correctness bugs!
Databricks has some excellent coverage on the main release and the new pipe syntax specifically.