This topic discusses some of the factors that can impact the performance of your indexing process. Because datasets have unique characteristics, you should experiment with different settings to find the optimal design for your application.
Configure your stream layer retention period to be greater than the "aggregation.window-seconds" value
Ensuring your stream layer retention period is long enough is an important design consideration for your pipeline because it correlates to the level of fault tolerance you build into your workflow. Your archiving pipeline will stream data continuously, batch data in memory based on
indexing attribute values &
aggregation.window-seconds value and then archive batched data periodically. The time necessary to archive your data will vary depending upon the configuration of your pipeline.
The recommendation is to always set the stream layer retention higher than
For example: If your aggregation-window.seconds value is 1800 (30 minutes), then you should configure your stream layer retention to at least 120 minutes. This configuration will ensure fault tolerance in case your pipeline experiences brief failure.
If your stream layer retention period is less than or equal to
aggregation.window-seconds value, data loss could occur.
The most important design consideration should be selecting the indexing attributes when creating an index layer. Note that these indexing attributes cannot be modified once an index layer is created. One way to think about indexing attributes is to consider the characteristics by which you want to query your indexed data.
For example, consider the following use case. You plan to index vehicle sensor data and you are interested in understanding different events occurring in different geographic locations at different times. In this use case, you would query your indexed data on multiple characteristics like event type, geolocation and timestamp. Therefore, you would design the index layer with following indexing attributes:
- name: eventType, type: String. Example value: fogHazard
- name: tileId, type: heretile (with desired zoom level). Example value: 95140
- name: eventTime, type: timewindow (with desired time slice duration). Example value: 1547053200000
Note that you must always include a timewindow attribute. The maximum number of additional attributes is three.
The small files problem is a well-known problem in the big data domain. The problem is that when data is broken down into a large number of small or very small files, processing them becomes very inefficient. In HERE platform indexing, the Data Archiving Library can cause this problem through excessive partitioning. In an index layer, the more attributes and attribute values there are, the more likely it is that the small files problem will manifest itself.
The size of the files in an index layer is inversely proportional to the number of partitions produced by the Data Archiving Library. This means you have to be very careful when determining the indexing approach. The potential total number of partitions is equal to the cartesian product of partitions for every attribute. This means both the number of attributes and the maximum number of unique values within every attribute should be as small as possible for a given data set. For example, using more than four attributes is discouraged unless the attributes have a limited number of values. In many cases, it is sufficient to have attributes such as timewindow (typically 1 hour to 1 day, depending on the data), location (a HERE tile of low tile level), and an additional attribute.
Take special care when determining the tile level of the location attribute. The typical approach for a map catalog, where the location is expressed as a HERE tile with a tile level of 12, is not appropriate for indexing. The number of possible values for the location attribute should be in the 1,000-10,000 range per timewindow.
Let's assume the following data stream characteristics:
- The stream rate is 1,000 messages per second
- Each message is 2 kB
- The data is aggregated in 1 hour increments (the
- The other two attributes are location (heretile) and event-type with a cardinality of 5
- Data is distributed evenly across the location and event-type attributes
So, the volume of the data per hour is:
7.2 GB (1,000 x 3,600 x 2kB = 7.2 GB)
And, the volume of the data per hour per event-type is:
1.44 GB (7.2 GB / 5 = 1.44 GB)
Ideally, individual data files should be in range of a single MB to tens of MB. This means we could have the location cardinality of 20 - 700 indicating desired tile level to be 3 - 5 (assuming a square bounding box of the location polygon). In practice, the higher the cardinality, the more workers (parallelism) are needed for the indexing process.
The following is an example configuration for SimpleUDF. This configuration works well for different aggregation window intervals like 10 mins, 30 mins and 1 hour.
avg input data size throughput: 5 GB/hour avg data size: 10k duration slice for indexing attribute of time window type: 60 mins number of event types: 8 number of tile ids: 10000 workers: ~ 12 - 15 (each worker with 1 worker unit i.e. 1 CPU - 7 GB RAM - 8 GB Disk Space)
The following is an example configuration for MultiKeysUDF:
avg input data size throughput: 5 GB/hour avg data size: 10k duration slice for indexing attribute of time window type: 60 mins number of event types: 10 number of tile ids: 6000 workers: ~ 20 - 25 (each worker with 1 worker unit i.e. 1 CPU - 7 GB RAM - 8 GB Disk Space) aggregation window interval: 30 mins expected message duplication: 3 times [because cardinality of indexing attributes is 3 (3 event types * 1 tile id * 1 ingestionTime)]
Protobuf, Avro, and Parquet are the recommended data format types for indexing data with the Data Archiving Library. Each of those formats has its own strengths and weaknesses which are summarized below.
While protobuf is a good format for data transfer or processing, it's not the best choice for data storage. It requires the presence of the schema which is external to the data files. Also problematic is the relatively weak support by the leading processing frameworks such as Apache Spark, and Flink.
Avro is a popular self-describing data format for row-oriented data storage. It has strong support for the schema evolution and represents a good fit for jobs that need to process or transform the data sequentially.
Parquet is one of the leading columnar data formats. It is a good choice for storing large volumes of data because it provides good compression characteristics due to the internal organization of the data in the files. Another strength of this format is its good fit for performing data analytics and running queries which perform data filtering and aggregation operations.
The Data Archiving Library includes an example application for Parquet which you can use to write Parquet data to an index layer. However, when you query the index layer, either using the REST APIs
blob, or using the Data Client Library, the Parquet data is returned as a raw byte array which could be complicated to process. We will provide a Spark connector in the Data Client Library in a future release to make querying Parquet data easier.