Critical applications and location services that rely on collecting and processing large amounts of data need the underlying infrastructure to be reliable and responsive. Within the HERE platform, we provide various frameworks for processing the location data. The processing is done using Batch or Stream Pipelines. There are failures or exceptions that can occur with a production pipeline. To ensure that the pipelines keep working reliably and are fulfilling the business requirements, we provide several options to create and deploy highly-available pipelines. This guide helps in understanding these options and the best way to use these options to manage certain failure/exception situations.
For a pipeline, the following situations can prevent it from being highly-available:
- Stream pipeline failed and, upon restart, it reprocessed the data or some data wasn't processed
- Stream pipeline failed and did not restart
- Stream pipeline failed because a Task Manager failed
- Stream pipeline failed because the Job Manager failed
- Pipeline failed and is not reachable
- Pipeline failed or restarted without a notification
The sections below describe each situation and the solutions to manage it.
For a failed Stream pipeline to continue processing, it's important to restore it from a previous running state. Flink Checkpointing is required to create such states. Checkpointing is disabled by default. When enabled, Flink takes consistent snapshots (checkpoints) of the pipeline (or, in Flink terms, the Job graph) at specified intervals. These checkpoints are used to restore the pipeline from the last running state when restarting the pipeline. For more information, see Flink Checkpointing section of Stream Pipeline Best Practices.
By default, a failed Stream pipeline doesn't restart. To enable the pipeline to restart upon a failure, Flink supports the following Restart strategies to control how the jobs are restarted in case of a failure:
- Fixed Delay Restart Strategy The fixed delay restart strategy attempts a given number of times to restart the job. If the maximum number of attempts is exceeded, the job eventually fails.
- Failure Rate Restart Strategy The failure rate restart strategy restarts job after failure, but when
failure rate(failures per time interval) is exceeded, the job eventually fails.
- No Restart Strategy (default) The job fails directly and no restart is attempted.
If Checkpointing is not enabled, the No Restart strategy is used. For more information, see Flink Restart Strategies.
When a Flink Task Manager fails or is unreachable, the pipeline tries to get a replacement for the Task Manager. This can take several minutes and in that time the data processing will be on hold. It is also possible that the pipeline fails if the retries exceed the time or frequency thresholds before a new Task Manager is spooled and assigned to it. To prevent these situations, it is recommended to provision extra Task Managers on standby, while configuring the pipeline's resource cluster. The extra Task Managers don't process the data during normal pipeline operation and are only used when an active Task Manager fails.
For example, if a pipeline has a parallelism of 8 and it is assigned with 9 Task Managers, then only 8 Task Managers will be used for processing the data. The 9th one will be on standby until one of the 8 active Task Managers fails. To take advantage of this setup, it is recommended to set a shorter time interval within the Flink restart strategy. When one of the active Task Manager fails, the pipeline will quickly select the already available extra Task Manager to continue processing and the pipeline will not keep re-trying and waiting for a new Task Manager. For more information on Flink Parallelism, see Setting Parallelism and Cluster Configuration and Parallelism sections of the Stream Pipeline Best Practices. For more information on Restart Strategies, see the Stream pipeline failed and did not restart section above.
Note: Additional cost
Provisioning extra Task Managers incurs additional cost. For a critical pipeline, several extra Task Managers can be provisioned.
Sometimes, the Flink Job Manager of the Stream pipeline fails and with that the running job and the pipeline fails. To prevent this situation, enable the High Availability option of the Flink Job Manager. High-Availability mode reduces the downtime during processing to near-zero and is beneficial for the time-sensitive Stream data processing. To learn to configure this option, see Flink Job Manager High Availability section of Steam Pipeline Best Practices.
If a pipeline fails and is completely unreachable, it's possible that something serious has happened with the region where it was hosted. For both Batch and Stream Pipelines that require minimum downtime in case of a failure of the primary region, the pipeline can be configured with the Multi-region option so that when the primary region fails, the pipeline gets automatically transferred to the secondary region. The running pipelines will be restarted in the secondary region. The non-running pipelines get transferred as-is. When the primary region is alive again, the pipelines get transferred back to it with the same care.
Note: Additional Cost for Multi-region setup
The transfer of the pipeline state between two regions is billed as Pipeline I/O.
Note: Input and output Catalogs must be available in the secondary region
For a pipeline to work successfully in the secondary region, the input and output catalogs used by the pipeline must be available in the secondary region.
Note: Enable Checkpointing for Stream pipelines
For a Stream pipeline to successfully utilize the Multi-region option, it's important to enable Checkpointing within the code to allow Flink to take a periodic Savepoint while running the pipeline in the primary region. When the primary region fails, the last available Savepoint is used to restart the pipeline in the secondary region. For more information on Checkpointing, see the Stream pipeline failed and, upon restart, it reprocessed the data or some data wasn't processed section above.
Note: Spark History transfer for Batch pipelines
For a Batch pipeline, the Spark History details are also transferred during a region failure.
Note: Activation of On-demand Batch pipelines
If the primary region fails, an on-demand Batch pipeline in the Running state is automaticaly re-activated in the secondary region. However, a failed on-demand Batch pipeline needs to be manually re-activated.
If a pipeline is restarted by HERE Platform without any notification, it can happen at a time that is not convenient for you.
For a Stream pipeline, it is recommended to provide an email address in the email field so that if the pipeline is restarted by HERE Platform for some reason like a planned outage, an email notification is sent to the assigned email address in order to allow you to take an appropriate action at a time that is convenient for you.
To learn more, see Enable Notification and Recovery of Stream Pipelines
Pipelines generate certain standard metrics that can be used to track their status over time. To monitor the status of a pipeline, a Pipeline Status dashboard is available in Grafana. Based on that dashboard, you can set failure alerts that can be sent via email or other mediums. Furthermore, Spark UI for Batch pipelines and Flink Dashboard for Stream pipelines is available in the platform and helpful in assessing a pipeline's performance for improvement. Finally, it's also possible to create custom metrics for the pipelines to monitor and get alerted for certain expected output. For more information, see Pipeline Monitoring section and the Logs, Monitoring, and Alerts User Guide.