Comparison Testing

In the context of the Data Validation Library, a comparison test compares two catalogs. These tests fall under two key use cases:

  • comparing two versions of the same catalog
  • comparing two versions of two distinct catalogs from which contents can be extracted into shared data structures

The validation library's comparison API is exposed in three packages with different levels of abstraction:

  • validation.core.comparison - provides an abstract recipe to implement comparisons for any kind of object using any kind of keys. You need to work directly with apache Spark's RDD objects.
  • validation.core.comparison.metadiff - provides an abstract implementation of the base recipe using a combination of layer names and and partition identifiers as keys. This package checks for differences by comparing the metadata checksums. For grouping and actually creating the output data, you are still required to work with RDDs.
  • validation.core.comparison.metadiff.grouped - provides two callbacks to work with pairs of differing catalog partition metadata; either layer-wise or partition-wise. Typically, this abstraction fulfills the majority of use cases.

The recommended process is to consider comparison.metadiff.grouped first, to see if it meets your requirements. Next, evaluate comparison.metadiff. If both of these packages do not meet your requirements, then consider the base level, comparison.

In turn, you can use comparison.metadiff as an example for how to use comparison.metadiff, and comparison.metadiff.grouped as an example for how to use comparison.metadiff.

To run a grouped comparison pipeline, refer to the quickstart-example in the SDK package.

In the sections below:

  • reference refers to the baseline catalog version
  • candidate refers to the catalog version being tested

The Comparison Package

The Comparator is the main class, which implements the inherited compile() function, that is called with the complete data of the reference.

You must provide access to the candidate's data. As shown in the code snippet below, the compile method extracts the candidate and reference data for comparison via the Joiner.join method, which you implement, and performs the actual comparison in the method which you also implement.

abstract class Comparator[K, C] (joiner: Joiner[K, C], comparison: Comparison[K, C])
    extends NonIncrementalCompiler {

    def queryReference(): InData
    final override def compile(candidateData: InData, parallelism: Int)(
      implicit logContext: LogContext): ToPublish = {
        val referenceData = queryReference
        val joinedData: JoinedData[K, C] = joiner.join(candidateData, referenceData)
        val results: ToPublish =

Since the following RDD declaration appears often, there is a type definition for convenience:

type JoinedData[K, C] = RDD[(K, (Option[C], Option[C]))]

The Joiner trait joins both the reference and candidate data. It is your implementation's responsibility to do this in a way that is suitable for the corresponding comparison.

trait Joiner[K, C] {
    def join(referenceData: InData, candidateData: InData): JoinedData[K, C]

The Comparison trait does the actual comparison of the previously joined data and returns output data appropriately to the output layer configuration.

trait Comparison[K, C] {
    def compare(data: JoinedData[K, C]): ToPublish

The ContextHelper class queries the reference data and also provides Retrievers for the reference and the candidate catalog. These Retrievers are needed if you want to access the actual partition's content by retrieving the Payload for the given partition's metadata.

The Metadiff Package

This package implements comparison by using a LayerKey as key to join the partitions' metadata.

case class LayerKey(layer: Layer.Id, partition: Partition.Name)

The MetadataComparison class defines a retrieveResults() callback, that you need to implement to handle the metadata pairs that differ in their partitions' payload checksums. Since this data remains in an RDD, you can still group it according to your output needs.

def retrieveResults(different: JoinedData[LayerKey, InMeta]): ToPublish

The Grouped Package

This package provides a quick and easy way for you to get a diff for a defined set of layers. As mentioned, this package offers two callbacks that you can use to work with pairs of differing catalog partition metadata: either layer-wise or partition-wise.

For layer-wise pairs:

def handleDiff(layer: Layer.Id,
                  partitioned: Iterable[(Partition.Name, Option[InMeta], Option[InMeta])])
      : Iterable[(OutKey, Option[Payload])]

results matching ""

    No results matching ""