Build Your Own Docker

Data SDK for Python can also be installed using Dockerfile, a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build, users can create an automated build that executes several command-line instructions in succession.

Prerequisites

Setup Files

To begin, sign into the HERE platform. This will renew your browser token so you can access the HERE platform repository. Next, download the docker archive, unzip the downloaded archive, and open a terminal in the unzipped folder:

For Linux/MacOS:

unzip docker-files.zip
cd docker-files/

For Windows:

cd docker-files\

Note

This software has Open Source Software dependencies, which will be downloaded and installed upon execution of the installation commands. See Dockerfile which is part of the zip file.

Copy the credential files(credentials.properties, hls_credentials.properties and settings.xml) into the current directory:

For Linux/MacOS:

cp ~/.here/credentials.properties .
cp ~/.here/hls_credentials.properties .
cp ~/.m2/settings.xml .

For Windows:

copy %USERPROFILE%\.here\credentials.properties .
copy %USERPROFILE%\.here\hls_credentials.properties .
copy %USERPROFILE%\.m2\settings.xml .

Update Configuration File

Sparkmagic configuration file (spark-conf-files.zip) includes Data SDK jars for version 2.11.7. Latest version of Data SDK jars can be identified using this link in the Include BOMs sub-section. If you wish to avail the latest Data SDK jars, execute the script config_file_updater.py using below commands:

python config_file_updater.py --version <version_to_upgrade_to>

Note

  • This script requires Python 3.7+ on your local machine.

Build Image

Build the Docker image:

docker build -t olp-sdk-for-python-1.8 --rm .

Note

  • The default Docker image name considered is olp-sdk-for-python-1.8, if you want to change it and create/update an image, specify the name in the command:

       docker build -t <yourimagename> --rm .
    

Execute Image

To execute the image in a container:

docker run -p 8080:8080 -p 8998:8998 -it olp-sdk-for-python-1.8

Note

  • Once you exit/restart the container, all your changes are lost. To retain the changes, the most common way is to use a Docker volume mount to mount another directory into your container:

       docker run -v <host_src>:<container_directory_to_mount> -p 8080:8080 -p 8998:8998 -it olp-sdk-for-python-1.8
    
  • Sample example to retain ivy cache jars for local spark: For Linux:

       docker run -v ~/.ivy2:/home/here/.ivy2 -p 8080:8080 -p 8998:8998 -it olp-sdk-for-python-1.8
    

    For Windows:

       docker run -v %USERPROFILE%\.ivy2:/home/here/.ivy2 -p 8080:8080 -p 8998:8998 -it olp-sdk-for-python-1.8
    

Tutorial Notebooks

Open the output Jupyter url in the browser and execute the sample notebooks.

The tutorial notebooks included with the SDK are located in the folder:

$HOME/olp-sdk-for-python-1.8/tutorial-notebooks/python.

We recommend reading the Getting Started notebook to get an overview of all of the tutorial notebooks:

$HOME/olp-sdk-for-python-1.8/tutorial-notebooks/GettingStarted.ipynb

API Reference

Explore the Data SDK for Python API reference by opening the html docs located at

$HOME/olp-sdk-for-python-1.8/documentation/Data SDK for Python API Reference.html.

Note

We recommend opening this documentation directly in Chrome and Firefox browsers instead of Jupyter or Internet Explorer.

Customized Execution of Image

To execute the image in a container, use this command:

docker run -p 8080:8080 -p 8998:8998 -it olp-sdk-for-python-1.8 /bin/bash

Note

  • Once you exit/restart the container, all your changes are lost. To retain the changes, the most common way is to use a Docker volume mount to mount another directory into your container:

       docker run -v <host_src>:<container_directory_to_mount> -p 8080:8080 -p 8998:8998 -it olp-sdk-for-python-1.8 /bin/bash
    
  • Sample example to retain ivy cache jars for local spark: For Linux:

       docker run -v ~/.ivy2:/home/here/.ivy2 -p 8080:8080 -p 8998:8998 -it olp-sdk-for-python-1.8 /bin/bash
    

    For Windows:

       docker run -v %USERPROFILE%\.ivy2:/home/here/.ivy2 -p 8080:8080 -p 8998:8998 -it olp-sdk-for-python-1.8 /bin/bash
    

Activate the conda environment:

source activate olp-sdk-for-python-1.8-env

Go to home directory and proceed to start Jupyter:

cd ~/
jupyter notebook --NotebookApp.iopub_data_rate_limit=1000000000 --ip=0.0.0.0 --port=8080

JupyterLab

If you work with the JupyterLab "desktop" instead of the "classic" Jupyter notebooks, use this command to start Jupyter:

cd ~/
jupyter lab --NotebookApp.iopub_data_rate_limit=1000000000 --ip=0.0.0.0 --port=8080

With JupyterLab you will benefit from installing a few additional JupyterLab extensions. These will either render files in some frequently used formats (e.g. HTML or GeoJSON) or some computed output (like Leaflet map cells) directly inside JupyterLab:

jupyter labextension install @mflevine/jupyterlab_html
jupyter labextension install @jupyterlab/geojson-extension
jupyter labextension install jupyter-leaflet
jupyter labextension install @jupyter-widgets/jupyterlab-manager

You might also be able to install these inside JupyterLab using its interactive Extension Manager.

Docker with Spark

You can start the Livy server using this command:

~/livy/bin/livy-server start

Livy server runs by default on localhost:8998. You can stop it by running:

~/livy/bin/livy-server stop

Tutorial Notebooks

The tutorial notebooks for Spark are located in the folder: $HOME/olp-sdk-for-python-1.8/tutorial-notebooks/spark.

EMR Spark Cluster

Edit the emr.env file providing your AWS and HERE platform repository credentials.

vi ~/.here/emr/emr.env

#!/usr/bin/env bash

# Credentials variables
export DEFAULT_AWS_ACCESS_KEY="your AWS access key"
export DEFAULT_AWS_ACCESS_KEY_SECRET="your AWS access key secret"
export DEFAULT_HERE_USER="your HERE maven repository user"
export DEFAULT_HERE_PASSWORD="your HERE maven repository password"

# Environment variables
export DEFAULT_EMR_CORES="2"
export DEFAULT_EMR_VERSION="emr-5.24.0"
export DEFAULT_EMR_MASTER_TYPE="m4.large"
export DEFAULT_EMR_WORKER_TYPE="m4.2xlarge"
export DEFAULT_TAG_TEAM="My Team"
export DEFAULT_TAG_PROJECT="My Project"
export DEFAULT_TAG_OWNER="Me"
export DEFAULT_TAG_ENV="PoC"
export DEFAULT_AWS_REGION="us-east-2"

Provision the EMR cluster:

emr-provision -ns <custom-single-word>

Note

<custom-single-word> is a suffix added to AWS resource names to avoid collisions. It should contain alphanumeric characters and hyphens only.

  • Please deprovision the cluster before exiting the docker container to prevent getting charged for unused infrastructure
  • Once you exit the container, the state of the docker container is lost. Hence, it won't be possible to deprovision after you exit and rerun. In case of issues, you can delete the AWS resources using an AWS console.

After successful provisioning, you should see a message similar to:

Apply complete! Resources: 20 added, 0 changed, 0 destroyed.

Outputs:

emr_master_public_dns = ec2-3-16-25-189.us-east-2.compute.amazonaws.com

Environment up and running, fully operational!

Access your Livy session list here:

>> http://ec2-3-16-25-189.us-east-2.compute.amazonaws.com:8998

Access the YARN Resource Manager here:

>> http://ec2-3-16-25-189.us-east-2.compute.amazonaws.com:8088

You can use this bucket to upload and process data

>> s3://spark-emrlab-bucket-lab

Within Jupyter, create a notebook, then select one of Python3 kernels and add the following cells:

Cell 1

%load_ext sparkmagic.magics

Cell 2

%%spark config
{
  "driverMemory": "2G",
  "executorMemory": "4G",
  "executorCores": 2,
  "conf": {
    "spark.scheduler.mode": "FAIR",
    "spark.executor.instances": 2,
    "spark.dynamicAllocation.enabled": "true",
    "spark.shuffle.service.enabled": "true",
    "spark.dynamicAllocation.executorIdleTimeout": "60s",
    "spark.dynamicAllocation.cachedExecutorIdleTimeout": "60s",
    "spark.dynamicAllocation.minExecutors": 1,
    "spark.dynamicAllocation.maxExecutors": 4,
    "spark.dynamicAllocation.initialExecutors": 1,
    "spark.jars.ivySettings": "/var/lib/spark/.here/ivy.settings.xml",
    "spark.driver.userClassPathFirst": "false",
    "spark.executor.userClassPathFirst": "false",
    "spark.jars.packages": "com.here.olp.util:mapquad:4.0.13,com.here.platform.location:location-compilation-core_2.11:0.11.156,com.here.platform.location:location-core_2.11:0.11.156,com.here.platform.location:location-inmemory_2.11:0.11.156,com.here.platform.location:location-integration-here-commons_2.11:0.11.156,com.here.platform.location:location-integration-optimized-map_2.11:0.11.156,com.here.platform.location:location-data-loader-standalone_2.11:0.11.156,com.here.platform.location:location-spark_2.11:0.11.156,com.here.platform.location:location-compilation-here-map-content_2.11:0.11.156,com.here.platform.location:location-examples-utils_2.11:0.4.115,com.here.schema.sdii:sdii_archive_v1_java:1.0.0-20171005-1,com.here.sdii:sdii_message_v3_java:3.3.2,com.here.schema.rib:lane-attributes_v2_scala:2.8.0,com.here.schema.rib:road-traffic-pattern-attributes_v2_scala:2.8.0,com.here.schema.rib:advanced-navigation-attributes_v2_scala:2.8.0,com.here.schema.rib:cartography_v2_scala:2.8.0,com.here.schema.rib:adas-attributes_v2_scala:2.8.0,com.typesafe.akka:akka-actor_2.11:2.5.11,com.beachape:enumeratum_2.11:1.5.13,com.github.ben-manes.caffeine:caffeine:2.6.2,com.github.cb372:scalacache-caffeine_2.11:0.24.3,com.github.cb372:scalacache-core_2.11:0.24.3,com.github.os72:protoc-jar:3.6.0,com.google.protobuf:protobuf-java:3.6.1,com.here.platform.data.client:blobstore-client_2.11:0.1.833,com.here.platform.data.client:spark-support_2.11:0.1.833,com.iheart:ficus_2.11:1.4.3,com.typesafe:config:1.3.3,org.apache.logging.log4j:log4j-api-scala_2.11:11.0,org.typelevel:cats-core_2.11:1.4.0,org.typelevel:cats-kernel_2.11:1.4.0,org.apache.logging.log4j:log4j-api:2.8.2,com.here.platform.data.client:data-client_2.11:0.1.833,com.here.platform.data.client:client-core_2.11:0.1.833,com.here.platform.data.client:hrn_2.11:0.1.614,com.here.platform.data.client:data-engine_2.11:0.1.833,com.here.platform.data.client:blobstore-client_2.11:0.1.833,com.here.account:here-oauth-client:0.4.14,com.here.platform.analytics:spark-ds-connector-deps_2.11:0.6.15,com.here.platform.analytics:spark-ds-connector_2.11:0.6.15",
    "spark.jars.excludes": "com.here.*:*_proto,org.json4s:*,org.apache.spark:spark-core_2.11,org.apache.spark:spark-sql_2.11,org.apache.spark:spark-streaming_2.11,org.apache.spark:spark-launcher_2.11,org.apache.spark:spark-network-shuffle_2.11,org.apache.spark:spark-unsafe_2.11,org.apache.spark:spark-network-common_2.11,org.apache.spark:spark-tags_2.11,org.scala-lang:scala-library,org.scala-lang:scala-compiler,org.scala-lang.modules:scala-parser-combinators_2.11,org.scala-lang.modules:scala-java8-compat_2.11,org.scala-lang:scala-reflect,org.scala-lang:scalap,com.fasterxml.jackson.core:jackson-*"
  }
}

Cell 3

# For a Scala Spark session
%spark add -s scala-spark -l scala -u <PUT YOUR LIVY ENDPOINT HERE> -k

# For a Pyspark Session
%spark add -s pyspark -l python -u <PUT YOUR LIVY ENDPOINT HERE> -k

Note

On EMR, it is necessary to explicitly provide the credentials to read HERE platform data in the notebook. You will need your HERE platform App ID and KeySecret to submit your job.

Use Your Credentials

Scala

%%spark
val accessKeyId = "<Your Access Key ID>"
val accessKeySecret = "<Your Access Key Secret>"
val layerHRN = "<Some Layern HRN>"

val df = spark.read.option( "partitions", 900)
            .option("parallelism", 4)
            .option("accesskeyid", accessKeyId) 
            .option("accesskeysecret", accessKeySecret)
            .ds(layerHRN)

PySpark

%%spark
accessKeyId = "<Your Access Key ID>"
accessKeySecret = "<Your Access Key Secret>"
layerHRN = "<Some Layern HRN>"

df = spark.read.format("com.here.platform.analytics.ds")
            .option("partitions", 900)
            .option("parallelism", 4)
            .option("accesskeyid", accessKeyId) 
            .option("accesskeysecret", accessKeySecret)
            .option("layerhrn", layerHRN)
            .load()

Start coding your job!

After finishing your job, destroy the cluster to prevent getting charged for unused infrastructure.

emr-deprovision

Deep Debugging

By default, internet access is restricted only to Livy and Yarn resource manager endpoints. If you want to explore the cluster logs and access the internal node machines you will need to open an SSH tunnel and connect. When you deploy a new cluster, we create an script command for you to open the SSH tunnel:

$ cd ~/.here/emr
$ ./emr-tunnel.sh

Next, you will need to install foxy proxy in your web browser:

Then, depending on your web browser, load the foxy proxy configuration that we provide at these file paths:

  • For Chrome: ~/anaconda3/envs/<your_env>/lib/olp-emr/util/foxy-proxy-chrome.xml
  • For Firefox: ~/anaconda3/envs/<your_env>/lib/olp-emr/util/foxy-proxy-firefox.json

Finally, you can activate Foxy proxy for all URLs or based on the patterns (See Foxy proxy for instructions). Now you will be able to access internal machine endpoints via your web browser.

Tutorial Notebooks

The tutorial notebooks for EMR are located in the folder:

$HOME/olp-sdk-for-python-1.8/tutorial-notebooks/emr.


Help us improve our setup experience, please fill out this short 1-minute survey after you are finished setting up the SDK. Complete survey


results matching ""

    No results matching ""