This topic presents a list of important caveats developers should consider when implementing a compiler. It is important that for developers to consider these guidelines, to reduce the chances of low performance or incorrect behavior of compilers.
Driver controls the distributed processing on Spark. Defining the tasks that Spark executes is the main entry point for the developers to the processing library.
To set up a
Driver, developers must implement one of the
DriverSetup interface's children. This is where the code to instantiate the compilers, to eventually prepare the broadcast variables, and to wire everything together fits.
It is recommended to use a
DriverBuilder for this purpose, to implement the
DriverSetupWithBuilder interface. Alternatively, developers can configure the driver tasks manually by implementing
To help run the pipeline, the library provides the
PipelineRunner trait, which implements the Scala main method that parses the command line and supports seamless integration with the Pipeline API.
Scala developers create one Scala object that mixes in
PipelineRunner and the appropriate child of
DriverSetup. After implementing the abstract methods coming from the chosen interface, that object can be run directly from the command line either by the Pipeline API or manually.
Java developers use the
PipelineRunner from the Java bindings. The current implementation of the does not directly expose the
Driver. It is an abstract class with the
DriverSetupWithBuilder interface already mixed-in that developers implement.
Spark relies on determinism of functions passed to the various RDD transformations, such as filter, map, groupBy, reduceByKey, and so on. These functions may be applied to the arguments multiple times, such as:
- when a task fails and it is retried
- when the same RDD partition is calculated more than once by the task due to lack of persistence or because a previously calculated RDD partition was removed from the cache
To operate properly, Spark requires these functions to behave deterministically, meaning that when functions are applied to the same input parameters, they always return the same result.
Similarly, the Data Processing Library and incremental compilation require data processing to be deterministic: a task should produce exactly the same commit when run multiple times on the same input catalogs at the same input versions. This means that partitions produced and their payloads must be identical.
Catalogs contain checksums of the payloads. So, to properly upload only payloads that have changed, the processing logic needs to be deterministic and produce the same output if the input did not change.
However, many Scala containers do not promise deterministic ordering for their elements. For example, although
Seq does promise determinism, containers such as
Set, do not. The code processing these containers should not rely on the ordering of elements as it produces the same result no matter the order.
The solution to this challenge is implementation specific, but usually involves a type of stable sorting for container elements or applying a commutative transformation, such as
This applies only to RDD-based Patterns.
Executors and some compilers work at the RDD level, meaning that RDDs are passed back and forth from the functions that each executor or compiler implements. It is important to define a common policy regarding persistence of the RDDs being passed and returned. Otherwise, there is a risk of Spark throwing an exception because some RDDs may be persisted twice with different storage levels.
This policy established is as follows:
- RDDs that are passed to each execute function are guaranteed to be reusable multiple times efficiently, without any need for the implementations to persist them. Implementations shall not persist RDDs that were passed. These are either already persisted by the library or guaranteed to be reusable multiple times efficiently. Therefore, implementations shall not
assertthat RDDs passed are persisted, although it is guaranteed that they will be, or equivalent.
- RDDs that are returned by each execute function do not have to be persisted. They may be persisted if it is useful to the implementations, but they do not have to be. The processing library may persist the RDDs once they are returned, if not already persisted.