The Spark Web UI provides insight into the Spark processing, with jobs, stages, and execution graphs, as well as logs from the executors. The Data Processing Library components also publish various statistics (see Spark
AccumulatorV2), such as the number of metadata partitions read, the number of data bytes downloaded or uploaded, and so on. This data can be seen in the stages where the operations were performed.
For locally executed compilers, the driver launches the UI web server as part of the driver process. While the driver is running, developers can access the web server from http://127.0.0.1:4040/jobs. The
PipelineRunner has a handy
--no-quit option that developers can use to make it wait for an ENTER key press before exiting after the final commit.
The "Task not serializable" is the most common exception in Spark development, especially when using complex class hierarchies. Whenever a function is executed in a Spark lambda, all of the variables it refers to (its closure) are serialized to the workers. In most cases, the easiest fix is to declare the function in an object instead of a class or inline, and pass all the required state information as parameters to the function.
If a lambda needs non-serializable state, such as a cache, a common pattern is a lazy val in the object that is initialized in every worker when accessed the first time. The val should also be marked
@transient to ensure it will not be serialized via references.