Apache Spark has become the de facto standard for handling large amounts of stationary and streamed data in a decentralized manner. The addition of the MLlib library, which consists of common learning algorithms and utilities, opened up Spark to a wide variety of machine learning tasks and paved the way for performing complex machine learning workflows on top of Apache Spark clusters. The main benefits of using Spark’s machine learning are:

  • Decentralized learning – Combine computationally heavy workloads such as distributed training or tuning of hyperparameters
  • Interactive exploratory analysis – Upload large data sets efficiently in a distributed way. Explore and understand data with the familiar Spark SQL interface
  • Introduction and change – Take, compile and re-tag large data sets.

At the same time, Spark, used by Data Scientists, brings its own challenges:

  • Complexity – Apache Spark uses a layered architecture that authorizes the master node, cluster management, and a set of employee nodes. Quite often, Spark is not deployed in isolation, but sits on top of a virtualized infrastructure (e.g., virtual machine or operating system-level virtualization). Maintaining the configuration of the cluster and its underlying infrastructure can be a complex and time consuming task
  • Lack of GPU acceleration – Complex machine loads, especially those involving in-depth learning, benefit from GPU architectures that are well suited for vector and matrix functions. The level of processor and processor-centric parallelism offered by Spark is generally not suitable for the large and fast registers and optimized bandwidth of the GPU architecture.
  • Publishing – Keeping a Spark cluster running and intermittent use can quickly become costly (especially if Spark is running in the cloud). Quite often Spark is only needed for a fraction of the ML piping (e.g., data preprocessing) because the result set it produces fits comfortably into some cuDF DataFrame structure

To meet the challenges of complexity and cost, Domino offers the ability to dynamically offer and organize a Spark cluster directly on an infrastructure that supports the Domino instance. This allows Domino users to gain quick access to Spark without having to rely on their IT team to create and manage one for them. Spark loads are fully packaged in a Domino Kubernetes cluster, and users can access Spark interactively through a Domino workspace (e.g., JupyterLab) or in batch mode via a Domino job or spark transmission. In addition, because Domino can take care of and disassemble clusters automatically and can rotate Spark clusters as needed, use them as part of a complex pipeline and tear them when the phase for which they were needed is complete.

To solve the need for a GPU-accelerated Spark, Domino is working with Nvidia. The Domino platform has been able to take advantage of GPU-accelerated hardware (both in the cloud and on-premises) for some time, and its underlying Kubernetes architecture naturally allows it to deploy and use NGC containers out of the box. This allows, for example, the natural use of data scientists NVIDIA FAST – a set of software libraries built on CUDA-X AI, giving them the freedom to run end-to-end science and analytics tubes entirely on graphics cards. In addition, Domino supports FAST Accelerator for Apache Spark, which combines the power of the RAPIDS cuDF library with the scale of a Spark distributed computational framework. The RAPIDS Accelerator library also has a built-in accelerated blend based on UCX which can be configured to take advantage of GPU-GPU communication and RDMA capabilities. With these features, Domino can provide streamlined access to GPU-accelerated ML / DL frames and GPU-accelerated Apache Spark components through a unified and Data Scientist-friendly interface.

Showing GPU-Accelerated Architecture - Spark Components with Nvidia SQL / DF Plugin and Accelerated ML / DL Frames on Spark 3.0 Core

Configuring Spark Clusters with RAPIDS Accelerator in Domino

By default, Domino does not have a Spark-compatible computing environment (Docker image), so our first task is to create one. Creating a new computing environment is a well-documented process, so check back official documents if you need recreation.

The most important steps are to give the new environment a name (e.g. Spark 3.0.0 GPU) and use bitumen / spark: 2.4.6 as a basic image. Domino’s on-demand Spark feature has been developed and tested using Bitnam’s open source Spark images (Because of this, if you are interested in). However, you can also use bitnami / spark: 3.0.0 picture because we’re switching inside the Spark installation, so it doesn’t matter.

Screenshot of the Domino New Environment interface

Next, we need to modify the Docker file in the Compute Environment to get Spark to version 3.0.0, add NVIDIA CUDA drivers, a RAPIDS accelerator, and a GPU search script. Adding the code below to Dockerfile Help starts rebuilding the computing environment.

# SPARK 3.0.0 GPU ENVIRONMENT DOCKERFILE
USER root

#
#  SPARK AND HADOOP
#

RUN apt-get update && apt-get install -y wget && rm -r /var/lib/apt/lists /var/cache/apt/archives

ENV HADOOP_VERSION=3.2.1
ENV HADOOP_HOME=/opt/hadoop
ENV HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
ENV SPARK_VERSION=3.0.0
ENV SPARK_HOME=/opt/bitnami/spark

### Remove the pre-installed Spark since it is pre-bundled with hadoop but preserve the python env
WORKDIR /opt/bitnami
RUN rm -rf ${SPARK_HOME}

### Install the desired Hadoop-free Spark distribution
RUN wget -q https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-without-hadoop.tgz && 
    tar -xf spark-${SPARK_VERSION}-bin-without-hadoop.tgz && 
    rm spark-${SPARK_VERSION}-bin-without-hadoop.tgz && 
    mv spark-${SPARK_VERSION}-bin-without-hadoop ${SPARK_HOME} && 
    chmod -R 777 ${SPARK_HOME}/conf

### Install the desired Hadoop libraries
RUN wget -q http://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz && 
    tar -xf hadoop-${HADOOP_VERSION}.tar.gz && 
    rm hadoop-${HADOOP_VERSION}.tar.gz && 
    mv hadoop-${HADOOP_VERSION} ${HADOOP_HOME}

### Setup the Hadoop libraries classpath
RUN echo 'export SPARK_DIST_CLASSPATH="$(hadoop classpath):${HADOOP_HOME}/share/hadoop/tools/lib/*:/opt/sparkRapidsPlugin"' >> ${SPARK_HOME}/conf/spark-env.sh
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:$HADOOP_HOME/lib/native"

### This is important to maintain compatibility with Bitnami
WORKDIR /
RUN /opt/bitnami/scripts/spark/postunpack.sh
WORKDIR ${SPARK_HOME}

#
# NVIDIA CUDA
#

RUN apt-get update && apt-get install -y --no-install-recommends 
gnupg2 curl ca-certificates && 
    curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | apt-key add - && 
    echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list && 
    echo "deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list && 
    apt-get purge --autoremove -y curl && 
rm -rf /var/lib/apt/lists/*

ENV CUDA_VERSION 10.1.243

ENV CUDA_PKG_VERSION 10-1=$CUDA_VERSION-1

# For libraries in the cuda-compat-* package: https://docs.nvidia.com/cuda/eula/index.html#attachment-a
RUN apt-get update && apt-get install -y --no-install-recommends 
        cuda-cudart-$CUDA_PKG_VERSION 
cuda-compat-10-1 && 
ln -s cuda-10.1 /usr/local/cuda && 
    rm -rf /var/lib/apt/lists/*

# Required for nvidia-docker v1
RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && 
    echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf

ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}
ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64

# nvidia-container-runtime
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
ENV NVIDIA_REQUIRE_CUDA "cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411"

ENV NCCL_VERSION 2.4.8

RUN apt-get update && apt-get install -y --no-install-recommends 
    cuda-libraries-$CUDA_PKG_VERSION 
cuda-nvtx-$CUDA_PKG_VERSION 
libcublas10=10.2.1.243-1 
libnccl2=$NCCL_VERSION-1+cuda10.1 && 
    apt-mark hold libnccl2 && 
    rm -rf /var/lib/apt/lists/*

RUN apt-get update && apt-get install -y --no-install-recommends 
        cuda-nvml-dev-$CUDA_PKG_VERSION 
        cuda-command-line-tools-$CUDA_PKG_VERSION 
cuda-libraries-dev-$CUDA_PKG_VERSION 
        cuda-minimal-build-$CUDA_PKG_VERSION 
        libnccl-dev=$NCCL_VERSION-1+cuda10.1 
libcublas-dev=10.2.1.243-1 
&& 
    rm -rf /var/lib/apt/lists/*

ENV LIBRARY_PATH /usr/local/cuda/lib64/stubs

ENV CUDNN_VERSION 7.6.5.32
LABEL com.nvidia.cudnn.version="${CUDNN_VERSION}"

RUN apt-get update && apt-get install -y --no-install-recommends 
    libcudnn7=$CUDNN_VERSION-1+cuda10.1 
libcudnn7-dev=$CUDNN_VERSION-1+cuda10.1 
&& 
    apt-mark hold libcudnn7 && 
    rm -rf /var/lib/apt/lists/*

# GPU Discovery Script
#
ENV SPARK_RAPIDS_DIR=/opt/sparkRapidsPlugin
RUN wget -q -P $SPARK_RAPIDS_DIR https://raw.githubusercontent.com/apache/spark/master/examples/src/main/scripts/getGpusResources.sh
RUN chmod +x $SPARK_RAPIDS_DIR/getGpusResources.sh
RUN echo 'export SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=1 -Dspark.worker.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh"' >> ${SPARK_HOME}/conf/spark-env.sh

ENV PATH="$PATH:$HADOOP_HOME/bin:$SPARK_HOME/bin"
WORKDIR ${SPARK_HOME}

RUN wget -q -P $SPARK_RAPIDS_DIR https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/0.1.0/rapids-4-spark_2.12-0.1.0.jar
RUN wget -q -P $SPARK_RAPIDS_DIR https://repo1.maven.org/maven2/ai/rapids/cudf/0.14/cudf-0.14-cuda10-1.jar
ENV SPARK_CUDF_JAR=${SPARK_RAPIDS_DIR}/cudf-0.14-cuda10-1.jar
ENV SPARK_RAPIDS_PLUGIN_JAR=${SPARK_RAPIDS_DIR}/rapids-4-spark_2.12-0.1.0.jar

We can ensure that the new environment has been successfully built by checking the Versions sections and ensuring that the active environment is the latest.

Display the Domino Versions tab with a successfully built version of the GPU computing environment.

Now that we have a Spark environment where the RAPIDS accelerator is in place, we need to create a Workspace environment — an environment that hosts the IDE we use to interact with Spark.

Creating a custom PySpark workspace environment is fully covered Domino’s official documentation. It is similar to building the Spark environment described above, the main differences being that we use the Domino base image (instead of the bitnam) and that we also need to define the workspaces that can be connected. The latter allows access to web-based tools in a computing environment (e.g. JupyterLab).

To build a workspace environment, we create a new computing environment (Spark 3.0.0 RAPIDS workspace Py3.6) using dominodatalab / base: Ubuntu18_DAD_Py3.6_R3.6_20200508 as the base image and add the following content to the Dockerfile Help section:

# SPARK 3.0.0 RAPIDS WORKSPACE DOCKERFILE

RUN mkdir -p /opt/domino

### Modify the Hadoop and Spark versions below as needed.
ENV HADOOP_VERSION=3.2.1
ENV HADOOP_HOME=/opt/domino/hadoop
ENV HADOOP_CONF_DIR=/opt/domino/hadoop/etc/hadoop
ENV SPARK_VERSION=3.0.0
ENV SPARK_HOME=/opt/domino/spark
ENV PATH="$PATH:$SPARK_HOME/bin:$HADOOP_HOME/bin"

### Install the desired Hadoop-free Spark distribution
RUN rm -rf ${SPARK_HOME} && 
    wget -q https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-without-hadoop.tgz && 
    tar -xf spark-${SPARK_VERSION}-bin-without-hadoop.tgz && 
    rm spark-${SPARK_VERSION}-bin-without-hadoop.tgz && 
    mv spark-${SPARK_VERSION}-bin-without-hadoop ${SPARK_HOME} && 
    chmod -R 777 ${SPARK_HOME}/conf

### Install the desired Hadoop libraries
RUN rm -rf ${HADOOP_HOME} && 
    wget -q http://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz && 
    tar -xf hadoop-${HADOOP_VERSION}.tar.gz && 
    rm hadoop-${HADOOP_VERSION}.tar.gz && 
    mv hadoop-${HADOOP_VERSION} ${HADOOP_HOME}

### Setup the Hadoop libraries classpath and Spark related envars for proper init in Domino
RUN echo "export SPARK_HOME=${SPARK_HOME}" >> /home/ubuntu/.domino-defaults
RUN echo "export HADOOP_HOME=${HADOOP_HOME}" >> /home/ubuntu/.domino-defaults
RUN echo "export HADOOP_CONF_DIR=${HADOOP_CONF_DIR}" >> /home/ubuntu/.domino-defaults
RUN echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${HADOOP_HOME}/lib/native" >> /home/ubuntu/.domino-defaults
RUN echo "export PATH=$PATH:${SPARK_HOME}/bin:${HADOOP_HOME}/bin" >> /home/ubuntu/.domino-defaults
RUN echo "export SPARK_DIST_CLASSPATH="$(hadoop classpath):${HADOOP_HOME}/share/hadoop/tools/lib/*"" >> ${SPARK_HOME}/conf/spark-env.sh

### Complete the PySpark setup from the Spark distribution files
WORKDIR $SPARK_HOME/python
RUN python setup.py install

### Optionally copy spark-submit to spark-submit.sh to be able to run from Domino jobs
RUN spark_submit_path=$(which spark-submit) && 
    cp ${spark_submit_path} ${spark_submit_path}.sh
    
ENV SPARK_RAPIDS_DIR=/opt/sparkRapidsPlugin
RUN wget -q -P $SPARK_RAPIDS_DIR https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/0.1.0/rapids-4-spark_2.12-0.1.0.jar
RUN wget -q -P $SPARK_RAPIDS_DIR https://repo1.maven.org/maven2/ai/rapids/cudf/0.14/cudf-0.14-cuda10-1.jar
ENV SPARK_CUDF_JAR=${SPARK_RAPIDS_DIR}/cudf-0.14-cuda10-1.jar
ENV SPARK_RAPIDS_PLUGIN_JAR=${SPARK_RAPIDS_DIR}/rapids-4-spark_2.12-0.1.0.jar

Note that we also add a RAPIDS accelerator to the end and set several environment variables to make the plugin readily available in the primary IDE (e.g. JupyterLab). We will also add the following mapping to the Pluggable Workspaces Tools section to make Jupyter and JupyterLab available through the Domino interface.

jupyter:
  title: "Jupyter (Python, R, Julia)"
  iconUrl: "/assets/images/workspace-logos/Jupyter.svg"
  start: [ "/var/opt/workspaces/jupyter/start" ]
  httpProxy:
    port: 8888
    rewrite: false
    internalPath: "/{{ownerUsername}}/{{projectName}}/{{sessionPathComponent}}/{{runId}}/{{#if pathToOpen}}tree/{{pathToOpen}}{{/if}}"
    requireSubdomain: false
  supportedFileExtensions: [ ".ipynb" ]
jupyterlab:
  title: "JupyterLab"
  iconUrl: "/assets/images/workspace-logos/jupyterlab.svg"
  start: [  /var/opt/workspaces/Jupyterlab/start.sh ]
  httpProxy:
    internalPath: "/{{ownerUsername}}/{{projectName}}/{{sessionPathComponent}}/{{runId}}/{{#if pathToOpen}}tree/{{pathToOpen}}{{/if}}"
    port: 8888
    rewrite: false
    requireSubdomain: false
vscode:
 title: "vscode"
 iconUrl: "/assets/images/workspace-logos/vscode.svg"
 start: [ "/var/opt/workspaces/vscode/start" ]
 httpProxy:
    port: 8888
    requireSubdomain: false
rstudio:
  title: "RStudio"
  iconUrl: "/assets/images/workspace-logos/Rstudio.svg"
  start: [ "/var/opt/workspaces/rstudio/start" ]
  httpProxy:
    port: 8888
    requireSubdomain: false

Once the workspace and Spark environments are available, everything is in place to launch GPU-accelerated Spark clusters. At this point, we just have to move on to an arbitrary project and define a new workspace. We can name the workspace On Demand Spark, select the Py3.6 environment of the Spark 3.0.0 RAPIDS workspace, and mark JupyterLab as the desired IDE. The hardware level selected for the workspace can be relatively small because the Spark cluster performs most of the heavy lifting.

Launch New Workspace screen in Domino.  The environment is set to the Spark 3.0.0 workspace, the IDE is JupyterLab.

On the Calculate Cluster screen, we select Spark, set the number of performers we want Domino to create for the cluster, and select the hardware levels for Spark performers and the Spark controller. We need to ensure that these hardware levels have Nvidia graphics cards if we want to benefit from using the RAPIDS accelerator.

In the Start New Workspace dialog box, calculate the Calculate Cluster tab, which shows the 2 implementers, the GPU HW level for performers, and the GPU HW level for the spark master.  The computing environment is set to Spark 3.0.0 GPU

When the cluster is complete and running, we will be presented with a JupyterLab instance. There is also an additional tab in the workspace – the Spark Web UI, which provides access to the web interface of the running Spark application and allows us to monitor and inspect relevant work implementations.

We can then create a notebook with a minimal example to test the configuration. First, we connect to the on-demand cluster and create the application:

from pyspark.sql import SparkSession

spark = SparkSession.builder 
                    .config("spark.task.cpus", 1) 
                    .config("spark.driver.extraClassPath", "/opt/sparkRapidsPlugin/rapids-4-spark_2.12-0.1.0.jar:/opt/sparkRapidsPlugin/cudf-0.14-cuda10-1.jar") 
                    .config("spark.executor.extraClassPath", "/opt/sparkRapidsPlugin/rapids-4-spark_2.12-0.1.0.jar:/opt/sparkRapidsPlugin/cudf-0.14-cuda10-1.jar") 
                    .config("spark.executor.resource.gpu.amount", 1) 
                    .config("spark.executor.cores", 6) 
                    .config("spark.task.resource.gpu.amount", 0.15) 
                    .config("spark.rapids.sql.concurrentGpuTasks", 1) 
                    .config("spark.rapids.memory.pinnedPool.size", "2G") 
                    .config("spark.locality.wait", "0s") 
                    .config("spark.sql.files.maxPartitionBytes", "512m") 
                    .config("spark.sql.shuffle.partitions", 10) 
                    .config("spark.plugins", "com.nvidia.spark.SQLPlugin") 
                    .appName("MyGPUAppName") 
                    .getOrCreate()

Note that we consider part of the configuration to be dynamic because it varies depending on the GPU hardware level that is running.

  • spark.task.cpus – the number of kernels to allocate for each task
  • spark.task.resource.gpu.amount – number of graphics processors per task. Note that this can be decimal and can be set according to the number of processors available at the processor hardware level. In this test, we set it to 0.15, which is just under 1/6 (6 processors share one GPU)
  • spark.executor.resource.gpu.amount – number of GPUs available at the hardware level (we have 1 V100 here)

Once the application is formatted and clustered, it will appear in the Spark Web UI section of the workspace:

Spark Web UI tab with 2 employees and MyGPUAppName with 12 cores and 1 GPU per processor

We can then perform a simple outer attachment task that looks like this.

df1 = spark.sparkContext.parallelize(range(1, 100)).map(lambda x: (x, "a" * x)).toDF()
df2 = spark.sparkContext.parallelize(range(1, 100)).map(lambda x: (x, "b" * x)).toDF()
df = df1.join(df2, how="outer")
df.count()

Once the count () operation is complete, we can check the first job of the DAG (for example) and see clearly that Spark is using GPU accelerated functions (e.g. GpuColumnarExchange, GpuHashAggregate, etc.)

Spark DAG visualization showing 2 steps with standard functions replaced by GPU-accelerated operations (e.g. GpuHashAggregate, GPUColumnarExchange, etc.)

Summary

In this post, we showed that configuring an on-demand Apache Spark cluster with RAPIDS Accelerator and GPU back-end applications is a fairly simple process in Domino. In addition to the benefits of not having to deal with underlying infrastructure, reducing the cost of ordering, and the ready-to-use replicability offered by the Domino platform, this configuration also significantly reduces processing times, making the computing team more efficient and enabling them to achieve higher model speeds.

Benchmark showing 3.8x acceleration and 50% cost savings in ETL workloads.

Benchmark publisher Nvidia shows 3.8x speed and 50% cost savings for an ETL workload performed on a FannieMae mortgage dataset (~ 200GB) using V100 GPU instances.

For more information, you can use the following additional resources:

LEAVE A REPLY

Please enter your comment!
Please enter your name here