Why Use the Data Processing Library
The Data Processing Library supports developers writing batch processing pipelines for the HERE Workspace. The processing library provides a means to easily interact with both the Pipeline API and the Data API via Spark so the developer can focus on their business logic (in Java or Scala) instead.
A typical use case for the processing library is to create artifacts that can be fed into services that perform tasks, such as routing destinations, rendering digital map objects, searching for places by name or address, and so on.
In more detail, for batch pipelines, the Data Processing Library:
- Processes versioned data by reading multiple input versioned layers from multiple input catalogs and writing the results to multiple output versioned layers in a single output catalog.
- Supports the distributed processing of partitioned catalog layers in Spark by enhancing the features of the Data Client Library. The processing library conveniently fetches the catalogs’ metadata on the master and distributes the tasks of reading and writing payload (blob) data to the nodes via RDDs. The processing library then manages the publishing of the data and performs the commit to the Data API transactionally.
- Provides a means to regularly process large amounts of partitioned data sets incrementally. In the context of maps this is known as map compilation, and can be used to keep maps up to date at low processing cost and time. However, this feature can also be applied to any partitioned data set. It's particularly valuable for large data sets of which only a small amount of partitions change from one batch processing step to the next. The Data Processing Library automatically identifies this and only works on the changed partitions. You can choose between a set of common high level processing patterns that allow you to focus on your business logic.
All of the aforementioned components can be used independently.