JupyterHub with Spark
JupyterHub is ideal to enable multiple users easily start predefined workspaces in the same project. The complimentary Apache Spark cluster can be used from the workspaces to perform distributed processing.
🧊 Install kfctl
You will need to have the usual oc
tool installed, and to install kfctl
on your machine, a tool to deploy Kubeflow applications, download the latest version for your OS 📥️
You can then install it by downloading the binary and putting it in your path, for example on Linux:
wget https://github.com/kubeflow/kfctl/releases/download/v1.2.0/kfctl_v1.2.0-0-gbc038f9_linux.tar.gz
tar -xzf kfctl_v1.2.0-0-gbc038f9_linux.tar.gz
sudo mv kfctl /usr/local/bin/
Clone the repository with the DSRI custom images and deployments for the OpenDataHub platform, and go to the kfdef
folder:
git clone https://github.com/MaastrichtU-IDS/odh-manifests
cd odh-manifests/kfdef
🪐 Deploy JupyterHub and Spark
All scripts need to be run from the kfdef
folder 📂
You can deploy JupyterHub with 2 different authentications system, use the file corresponding to your choice:
For the default DSRI authentication use
kfctl_openshift_dsri.yaml
For GitHub authentication use
kfctl_openshift_github.yaml
You need to create a new GitHub OAuth app: https://github.com/settings/developers
And provide the GitHub client ID and secret through environment variable before running the start script:
export GITHUB_CLIENT_ID=YOUR_CLIENT_ID
export GITHUB_CLIENT_SECRET=YOUR_CLIENT_SECRET
First you will need to change the namespace:
in the file you want to deploy, to provide the project where you want to start JupyterHub (currently opendatahub-ids
), then you can deploy JupyterHub and Spark with kfctl
:
./start_odh.sh kfctl_openshift_dsri.yaml
🗄️ Persistent volumes are automatically created for each instance started in JupyterHub to insure persistence of the data even JupyterHub is stopped. You can find the persistent volumes in the DSRI web UI, go to the Administrator view > Storage > Persistent Volume Claims.
⚡️ A Spark cluster with 3 workers is automatically created with the service name spark-cluster
, you can use the URL of the master node to access it from your workspace: spark://spark-cluster:7077
✨ Use the Spark cluster
Make sure all the Spark versions are matching, the current default version is 3.0.1
You can test the Spark cluster connection with PySpark:
from pyspark.sql import SparkSession, SQLContext
import os
import socket
# Create a Spark session
spark_cluster_url = "spark://spark-cluster:7077"
spark = SparkSession.builder.master(spark_cluster_url).getOrCreate()
sc = spark.sparkContext
# Test your Spark connection
spark.range(5, numPartitions=5).rdd.map(lambda x: socket.gethostname()).distinct().collect()
# Or try:
#x = ['spark', 'rdd', 'example', 'sample', 'example']
x = [1, 2, 3, 4, 5]
y = sc.parallelize(x)
y.collect()
# Or try:
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
distData.reduce(lambda a, b: a + b)
Match the version
Make sure all the Spark versions are matching, the current default version is 3.0.1
:
- Go to the Spark UI to verify the version of the Spark cluster
- Run
spark-shell --version
to verify the version of the Spark binary installed in the workspace - Run
pip list | grep pyspark
to verify the version of the PySpark library
Check the JupyterLab workspace Dockerfile
to change the version of Spark installed in the workspace, and see how you can download and install a new version of the Spark binary.
If you need to change the Python, Java or PySpark version in the workspace you can create a environment.yml
file, for example for 2.4.5
:
name: spark
channels:
- defaults
- conda-forge
- anaconda
dependencies:
- python=3.7
- openjdk=8
- ipykernel
- nb_conda_kernels
- pip
- pip:
- pyspark==2.4.5
Create the environment with conda
:
mamba env create -f environment.yml
Spark UI
You can also create a route to access the Spark UI and monitor the activity on the Spark cluster:
oc expose svc/spark-cluster-ui
Get the Spark UI URL:
oc get route --selector radanalytics.io/service=ui --no-headers -o=custom-columns=HOST:.spec.host
New Spark cluster
You can create a new Spark cluster, for example here using Spark 3.0.1
with the installed Spark Operator:
cat <<EOF | oc apply -f -
apiVersion: radanalytics.io/v1
kind: SparkCluster
metadata:
name: spark-cluster
spec:
customImage: quay.io/radanalyticsio/openshift-spark:3.0.1-2
worker:
instances: '10'
memory: "4Gi"
cpu: 4
master:
instances: '1'
memory: "4Gi"
cpu: 4
env:
- name: SPARK_WORKER_CORES
value: 4
EOF
You can browse the list of available image versions here
See the Radanalytics Spark operator example configuration for more details on the Spark cluster configuration.
🗑️ Delete the deployment
Delete the running JupyterHub application and Spark cluster, including persistent volumes:
./delete_odh.sh kfctl_openshift_dsri.yaml