Skip to main content

JupyterHub with Spark

JupyterHub is ideal to enable multiple users easily start predefined workspaces in the same project. The complimentary Apache Spark cluster can be used from the workspaces to perform distributed processing.

🧊 Install kfctl

You will need to have the usual oc tool installed, and to install kfctl on your machine, a tool to deploy Kubeflow applications, download the latest version for your OS 📥️

You can then install it by downloading the binary and putting it in your path, for example on Linux:

wget https://github.com/kubeflow/kfctl/releases/download/v1.2.0/kfctl_v1.2.0-0-gbc038f9_linux.tar.gz
tar -xzf kfctl_v1.2.0-0-gbc038f9_linux.tar.gz
sudo mv kfctl /usr/local/bin/

Clone the repository with the DSRI custom images and deployments for the OpenDataHub platform, and go to the kfdef folder:

git clone https://github.com/MaastrichtU-IDS/odh-manifests
cd odh-manifests/kfdef

🪐 Deploy JupyterHub and Spark

Go the the kfdef folder

All scripts need to be run from the kfdef folder 📂

You can deploy JupyterHub with 2 different authentications system, use the file corresponding to your choice:

  • For the default DSRI authentication use kfctl_openshift_dsri.yaml

  • For GitHub authentication use kfctl_openshift_github.yaml

    • You need to create a new GitHub OAuth app: https://github.com/settings/developers

    • And provide the GitHub client ID and secret through environment variable before running the start script:

      export GITHUB_CLIENT_ID=YOUR_CLIENT_ID
      export GITHUB_CLIENT_SECRET=YOUR_CLIENT_SECRET

First you will need to change the namespace: in the file you want to deploy, to provide the project where you want to start JupyterHub (currently opendatahub-ids), then you can deploy JupyterHub and Spark with kfctl:

./start_odh.sh kfctl_openshift_dsri.yaml

🗄️ Persistent volumes are automatically created for each instance started in JupyterHub to insure persistence of the data even JupyterHub is stopped. You can find the persistent volumes in the DSRI web UI, go to the Administrator view > Storage > Persistent Volume Claims.

⚡️ A Spark cluster with 3 workers is automatically created with the service name spark-cluster, you can use the URL of the master node to access it from your workspace: spark://spark-cluster:7077

✨ Use the Spark cluster

Matching Spark versions

Make sure all the Spark versions are matching, the current default version is 3.0.1

You can test the Spark cluster connection with PySpark:

from pyspark.sql import SparkSession, SQLContext
import os
import socket
# Create a Spark session
spark_cluster_url = "spark://spark-cluster:7077"
spark = SparkSession.builder.master(spark_cluster_url).getOrCreate()
sc = spark.sparkContext

# Test your Spark connection
spark.range(5, numPartitions=5).rdd.map(lambda x: socket.gethostname()).distinct().collect()
# Or try:
#x = ['spark', 'rdd', 'example', 'sample', 'example']
x = [1, 2, 3, 4, 5]
y = sc.parallelize(x)
y.collect()
# Or try:
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
distData.reduce(lambda a, b: a + b)

Match the version

Make sure all the Spark versions are matching, the current default version is 3.0.1:

  • Go to the Spark UI to verify the version of the Spark cluster
  • Run spark-shell --version to verify the version of the Spark binary installed in the workspace
  • Run pip list | grep pyspark to verify the version of the PySpark library

Check the JupyterLab workspace Dockerfile to change the version of Spark installed in the workspace, and see how you can download and install a new version of the Spark binary.

If you need to change the Python, Java or PySpark version in the workspace you can create a environment.yml file, for example for 2.4.5:

name: spark
channels:
- defaults
- conda-forge
- anaconda
dependencies:
- python=3.7
- openjdk=8
- ipykernel
- nb_conda_kernels
- pip
- pip:
- pyspark==2.4.5

Create the environment with conda:

mamba env create -f environment.yml

Spark UI

You can also create a route to access the Spark UI and monitor the activity on the Spark cluster:

oc expose svc/spark-cluster-ui

Get the Spark UI URL:

oc get route --selector radanalytics.io/service=ui --no-headers -o=custom-columns=HOST:.spec.host

New Spark cluster

You can create a new Spark cluster, for example here using Spark 3.0.1 with the installed Spark Operator:

cat <<EOF | oc apply -f -
apiVersion: radanalytics.io/v1
kind: SparkCluster
metadata:
name: spark-cluster
spec:
customImage: quay.io/radanalyticsio/openshift-spark:3.0.1-2
worker:
instances: '10'
memory: "4Gi"
cpu: 4
master:
instances: '1'
memory: "4Gi"
cpu: 4
env:
- name: SPARK_WORKER_CORES
value: 4
EOF

You can browse the list of available image versions here

See the Radanalytics Spark operator example configuration for more details on the Spark cluster configuration.

🗑️ Delete the deployment

Delete the running JupyterHub application and Spark cluster, including persistent volumes:

./delete_odh.sh kfctl_openshift_dsri.yaml