In the HERE Workspace, a pipeline is an executable data processing software application that is designed to run on a streaming or batched processing framework that is a part of the Pipeline API. This application is commonly referred to as a pipeline because it channels the data input through a defined sequence of manipulations and transformations that have a single end point. The pipeline application is compiled into a fat JAR file for distribution and use in the operational environment provided by the Pipeline API. It has two basic components: the framework interface and the data processing workflow. The data processing workflow is the hard-coded data transformation algorithms that makes each pipeline unique, especially when using Workspace libraries and resources. The specialized algorithms in the pipeline are required to transform the data input into a useful (productized) form in the data output. Execution can occur in a streamed data environment based on Apache Flink or in a batch processed environment based on Apache Spark. But the application must be preconfigured exclusively for use in a streamed or batch processing environment, never in both. All pipelines must have at least one data source (catalog) and just one data sink (catalog) that are external to the pipeline itself as shown below.
The Framework Interface, Data Ingestion and Data Output are all artifacts of the data being processed and the selected processing framework. They can be considered basic components that are required for the pipeline to run within the processing framework. Based on Maven archetypes supplied in the HERE Data SDK for Java & Scala, the basics of a pipeline development project are predefined. The data processing workflow distinguishes a pipeline for specific data manipulations and transformations that are applied to the input from the Data Source. The workflow results are output to the Data Sink for temporary storage. The workflow is designed to execute a unique set of business rules and algorithmic transformations on the processed data according to its design. Run-time considerations are typically addressed as a set of specific configuration parameters applied to the pipeline when it is deployed and a job is initiated.
HERE pipelines are implemented as a classic software pipeline application where a series of data processing elements is encapsulated into a reusable pipeline component. Each pipeline is intended to process input data in streams or batches and outputs the results to a specified destination catalog.
- Pipelines can contain any combination of data processing algorithms in a reusable JAR file package.
- Pipelines can be built, using Scala or Java, based on a standard pipeline application template.
- Pipelines are compiled and distributed as fat JAR files, for ease of management, deployment and use.
- Pipelines can be highly specialized or very flexible, based on how the data processing workflow is designed.
- Pipelines are deployed with a set of runtime parameters that allow as much pipeline flexibility as needed.
- The Pipeline Service uses standard framework interfaces for processing streaming data (Apache Flink 1.7.1) and for batched data (Apache Spark 2.4.1).
- Pipelines can be chained by using the output catalog of one pipeline as the input catalog of another pipeline, although there are no tools included in the Workspace specifically for implementing such a scheme at this time.
- Pipelines can be deployed and managed from (1) the Command Line Interface (the OLP CLI), (2) through the portal, or (3) by a custom user application using the pipeline service REST API.
Developing a pipeline in the HERE Workspace is the process of creating a pipeline JAR file.
Pipelines go through a design and implementation process before they can be used operationally. After the pipeline is designed and implemented, it exists as an executable JAR file. These JAR files can be used operationally by the Workspace as needed. Because each pipeline JAR file can only be used in either a streamed or a batch execution environment, its design must accommodate the design requirements and restrictions of that selected environment. If a streamed environment is selected, the JAR file must be executable on the Apache Flink framework embedded in the pipeline. If a batch environment is selected, the JAR file must be executable on the Apache Spark framework embedded in the pipeline. Flink and Spark each have specific requirements for their JAR file designs.
A new pipeline is typically created using the following process:
- A functional objective or business goal is defined for the pipeline, typically defining a basic data workflow.
- A set of algorithms for manipulating the data to achieve the objective is developed.
- Those algorithms must be implemented in either Java or Scala to be compatable with the pipeline templates.
- The implemented algorithms are then integrated into a pipeline application targeting a streaming or batch processing model. Maven archetypes are used to actually build the pipeline.
- Any runtime parameters required by the implemented algorithms are defined during the integration process.
- A fat JAR file is created and tested.
- The fat JAR file, with any associated libraries or other assets, is the pipeline deliverable that is deployed onto the pipeline.
An operational requirement will describe the individual pipelines and catalogs to be used and the sequencing of their execution. This is the unique topology that must be deployed and can include as few as 1 and as many individual pipeline stages as the computing environment can support. Because of this flexibility, pipelines can be designed for reuse, or not, as needed.
Keep in mind that pipelines do not live in a vacuum. For every pipeline there is a data source and an output catalog to contain the data processed by the pipeline. That output catalog must be compatible with the data transformations done in the pipeline. The range of possible variations in input and output catalogs can be seen in:
Operationally, pipelines are deployed and monitored as they run. The deployment process includes definitions of data sources and data destinations, thus implementing the more complex topologies described in a specific operational requirement. As the pipelines execute, their activity is monitored and logged for later analysis, if required. Additional tools can be used to generate alerts based on events during data processing.
Deployed pipelines begin processing data when an activate command is issued for a specific Pipeline Version ID. Only a configured Pipeline Version can process data on the HERE Workspace. The following operational commands can be directed to any running pipeline:
- Pause - freezing pipeline operation until a resume command is issued.
- Resume - restarts a paused pipeline from the point where execution was paused.
- Cancel - terminates an executing pipeline from processing any data.
- Upgrade - allows a pipeline version to be replaced by another pipeline version.
If you are a developer and you want to start using the HERE Workspace, note the following:
The Workspace is designed to build distributed processing pipelines for location-related data. Your data is stored in catalogs. Processing is done in pipelines, which are applications written in Java or Scala and run on an Apache Spark or Apache Flink framework.
The Workspace abstracts away the management, provisioning, scaling, configuration, operation and maintenance of server side components and let's you focus purely on the logic of the application and the data required to develop a use case. Thus, the Workspace is mainly aimed at developers doing data processing, compilation, visualization, etc., but also allows data analysts to do ad-hoc development.
At this point, go to: