The Data Processing Library runs complex compile patterns incrementally, whenever possible. This is based on the principle that an incremental run produces output that is identical to that of a non-incremental run; only faster. If an incremental compilation from version N to N+1 produces a different commit than just compiling version N+1 and publishing the changed payloads, this is considered an error. The
Driver may prevent incremental compilation if any the following conditions is met:
Reprocessis present in the
Jobdescription instead of
Fingerprintsof the library and your code from the last run are compared with the current ones and incremental compilation is not attempted if the fingerprints do not match.
Fingerprintsof the configuration from the last run is compared with the current ones from the active configuration and incremental compilation is not attempted if the fingerprints do not match.
It is possible for compilers to access external content, such as content not available in the input catalogs. The processing library does not and cannot block access to external data, but your compiler must make the library aware of this data.
Unless the compiler is a non-incremental compiler, the processing library must be aware of this external content. This is because, if the external content changes, it is not safe to run incrementally as some output partition may result in having content calculated from the updated externals while unchanged partition keep having content calculated from the previous externals. This may render the output catalog inconsistent, invalid, or result in unintended behavior. Therefore, when the external content changes, it is important to notify the
Driver so it runs non-incrementally once, to update all the output partitions. Subsequent runs are then incremental again.
While implementing one of the setup pattern children of the
DriverSetup interface, you can access a
DriverContext and its
Fingerprints. Using the
addCustomHash method, you register the hash of external content. If this hash is different from the one registered in the previous run, the
Driver will not run incrementally. Hashes are persisted together with
Fingerprints and checked automatically.
Not registering the hash of external content may render incremental compilation unsafe or unwanted final behavior. Registering the hash is not mandatory, as long as you are aware of the consequences.
If the content is big, you can use the
toBroadcast method in the
broadcast package. This creates a Spark broadcast in a safe manner for incremental compilation from the value you provide. The mechanisms of Spark broadcast ensures effective distribution of the content to nodes. For more details, see Broadcast Input Layers to Executor Nodes.