This topic presents a list of important caveats developers should consider when implementing a compiler. It is important that for developers to consider these guidelines, to reduce the chances of low performance or incorrect behavior of compilers.
Driver controls the distributed processing on Spark. Defining the tasks that Spark executes is the main entry point for the developers to the processing library.
To set up a
Driver, developers must implement one of the
DriverSetup interface's children. This is where the code to instantiate the compilers, to eventually prepare the broadcast variables, and to wire everything together fits.
It is recommended to use a
DriverBuilder for this purpose, to implement the
DriverSetupWithBuilder interface. Alternatively, developers can configure the driver tasks manually by implementing
To help run the pipeline, the library provides the
PipelineRunner trait, which implements the Scala main method that parses the command line and supports seamless integration with the Pipeline API.
Scala developers create one Scala object that mixes in
PipelineRunner and the appropriate child of
DriverSetup. After implementing the abstract methods coming from the chosen interface, that object can be run directly from the command line either by the Pipeline API or manually.
Java developers use the
PipelineRunner from the Java bindings. The current implementation of the does not directly expose the
Driver. It is an abstract class with the
DriverSetupWithBuilder interface already mixed-in that developers implement.
Spark relies on determinism of functions passed to the various RDD transformations, such as filter, map, groupBy, reduceByKey, and so on. These functions may be applied to the arguments multiple times, such as:
To operate properly, Spark requires these functions to behave deterministically, meaning that when functions are applied to the same input parameters, they always return the same result.
Similarly, the Data Processing Library and incremental compilation require data processing to be deterministic: a task should produce exactly the same commit when run multiple times on the same input catalogs at the same input versions. This means that partitions produced and their payloads must be identical.
Catalogs contain checksums of the payloads. So, to properly upload only payloads that have changed, the processing logic needs to be deterministic and produce the same output if the input did not change.
However, many Scala containers do not promise deterministic ordering for their elements. For example, although
Seq does promise determinism, containers such as
Set, do not. The code processing these containers should not rely on the ordering of elements as it produces the same result no matter the order.
The solution to this challenge is implementation specific, but usually involves a type of stable sorting for container elements or applying a commutative transformation, such as
This applies only to RDD-based Patterns.
Executors and some compilers work at the RDD level, meaning that RDDs are passed back and forth from the functions that each executor or compiler implements. It is important to define a common policy regarding persistence of the RDDs being passed and returned. Otherwise, there is a risk of Spark throwing an exception because some RDDs may be persisted twice with different storage levels.
This policy established is as follows:
assertthat RDDs passed are persisted, although it is guaranteed that they will be, or equivalent.