Spark Connector implements the standard Spark interfaces that allows you to read from a catalog and get a data set as a
DataFrame[Row] and write a
DataFrame to a catalog.
As a result, you can use all standard Spark APIs and functions like select, filter, map, collect etc. to work with the data.
Spark connector supports batch processing for the versioned, volatile and index layers. At this point, structured streaming is not a supported use case therefore the streaming layer is not supported by the Spark connector.
All stream-oriented pipelines are based on Flink Connector.
In case a batch job needs to output the data to the streaming layer the use of the WriteEngine is recommended.
The Spark connector is providing a unified access to the catalog data and metadata, freeing the user from handling those two aspects separately. Unlike majority of other Spark connectors the HERE platform Spark connector does support the delete operation as well allowing the user to remove the data from all supported layer types.
|Layer Type||Protobuf||Avro||Parquet||Raw (octet-stream)|
|Index layer||Read, Write, Delete||Read, Write, Delete||Read, Write, Delete||Read, Write, Delete|
|Versioned layer||Read, Write||Read, Write||Read, Write||Read, Write|
|Volatile layer||Read, Write, Delete||Read, Write, Delete||Read, Write, Delete||Read, Write, Delete|
The write operation includes create and update operations.
Protobuf, avro and parquet data formats are automatically decoded and encoded depending on layer configuration. For protobuf the layer configuration should reference the associated schema. If this is not the case there will be thrown an exception. For raw data format the user needs to provide custom decoder and encoder.
For Spark connector configuration please see here.