Many elements go into making a pipeline. The following list defines the most important components.
application.properties, which you can reference from within the pipeline code. For more information, see the Configuration File Reference.
SchedulerConfigis a property used in each Pipeline Version. The scheduler controls when a Job is created and submitted to the Flink or Spark cluster for processing. The scheduler may start a new job when the previous one completes as expected, or not as expected. The Scheduler polls or waits for changes from upstream catalogs. It can also operate due to timers or other external triggers. It includes properties like when to start a Job, whether to restart terminated jobs, and polling intervals for upstream catalogs. For more information, see the Configuration File Reference.
| ||Number of resource units per Supervisor (Flink JobManager or Spark Driver)|
| ||Number of resource units per Worker (Flink TaskManager or Spark Executor)|
| ||Number of Workers (number of Flink TaskManagers or Spark Executors)|
Input and Output Catalogs – The Input Catalog is the data source for the pipeline. The Output Catalog is the data destination from the pipeline. A pipeline can have more than one input catalog, but can only have one output catalog.
For a Stream pipeline, you may need to specify catalog versions, depending on the type of catalog layer used (that is, versioned, volatile, or streaming).
For a Batch pipeline, you can choose to run the pipeline immediately or schedule it to run when the input catalog data is updated. To run the pipeline immediately, you must specify the catalog versions. To schedule the pipeline mode, you do not need to specify the catalog versions. Instead, the pipeline scheduler will check the input and output catalogs every five minutes to capture changes and check for consistent versions for all the catalogs to be processed.
For example, assume that the pipeline has two input catalogs and one has changes from an upstream catalog version 5 and the other input catalog also includes the same upstream catalog, but has not yet processed version 5. Then, you cannot run the pipeline because the two input catalogs versions are not consistent with the upstream catalog version.
pipeline-config.conf – This file lists the parameters describing input catalogs, output catalog, and billing tag. The pipeline uses this information to determine whether the catalogs have changed and the scheduled batch pipeline should be run to process the changes. For more information, see the Configuration File Reference.