Skip to main content

Run MPI jobs

We deployed the MPI Operator from Kubeflow to run MPI jobs on the DSRI.

The MPI Operator makes it easy to run allreduce-style distributed training on Kubernetes. Please check out this blog post for an introduction to MPI Operator and its industry adoption.

Run MPI jobs on CPU

Checkout the repository of the CPU benchmark for a complete example of an MPI job: python script, Dockerfile, and the job deployment YAML.

  1. Clone the repository, and go to the example folder:
git clone https://github.com/kubeflow/mpi-operator.git
cd mpi-operator/examples/horovod
  1. Open the tensorflow-mnist.yaml file, and fix the apiVersion on the first line:
# From
apiVersion: kubeflow.org/v1
# To
apiVersion: kubeflow.org/v1alpha2

You will also need to specify those containers can run with the root user by adding the serviceAccountName between spec: and container: for the launcher and the worker templates:

      template:
spec:
serviceAccountName: anyuid
containers:
- image: docker.io/kubeflow/mpi-horovod-mnist

Your tensorflow-mnist.yaml file should look like this:

apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
name: tensorflow-mnist
spec:
slotsPerWorker: 1
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
serviceAccountName: anyuid
containers:
- image: docker.io/kubeflow/mpi-horovod-mnist
name: mpi-launcher
command:
- mpirun
args:
- -np
- "2"
- --allow-run-as-root
- -bind-to
- none
- -map-by
- slot
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- python
- /examples/tensorflow_mnist.py
resources:
limits:
cpu: 1
memory: 2Gi
Worker:
replicas: 2
template:
spec:
serviceAccountName: anyuid
containers:
- image: docker.io/kubeflow/mpi-horovod-mnist
name: mpi-worker
resources:
limits:
cpu: 2
memory: 4Gi

  1. Once this has been set, create the job in your current project on the DSRI (change with oc project my-project):
oc create -f tensorflow-mnist.yaml

You should see the 2 workers and the main job running in your project Topology page in the DSRI web UI. You can then easily check the logs of the launcher and workers.

To run your own MPI job on the DSRI, you can take a look at, and edit, the different files provided by the MPI Operator example:

🐍 tensorflow_mnist.py: the python script with the actual job to run

🐳 Dockerfile.cpu: the Dockerfile to define the image of the containers in which your job will run (install dependencies)

⛵️ tensorflow-mnist.yaml: the YAML file to define the MPI deployment on Kubernetes (number and limits of workers, mpirun command, etc)

Visit the Kubeflow documentation to create a MPI job for more details.

Contact us

Feel free to contact us on the DSRI Slack #helpdesk channel to discuss the use of MPI on the DSRI.