MPI Jobs
MPI (Message Passing Interface) is a standard for running parallel computing jobs across multiple processes or machines. It is widely used in scientific computing, high performance computing (HPC), and distributed machine learning training.
The MPI Operator from Kubeflow makes it easy to run MPI jobs on DSRI. It manages the launcher (the main process that coordinates the work) and the workers (the processes that do the actual computation) as Kubernetes pods running in parallel.
To be able to deploy MPI jobs you will need to ask the RCS team to enable the MPI Operator in your project. Once enabled you will be able to submit MPI jobs to the cluster.
Test your setup
Before running your own code, verify that MPI works in your project by running the example pi calculation job. This job uses 2 workers to calculate the value of pi in parallel.
1. Clone the MPI operator repository
git clone https://github.com/kubeflow/mpi-operator.git
cd mpi-operator/examples/v2beta1/pi
2. Submit the job
Make sure you are in the correct project:
oc project my-project
Then create the job:
oc create -f pi.yaml
3. Monitor the job
Check the job status:
oc get mpijob pi
Check the pods (you should see 1 launcher and 2 workers):
oc get pods
4. Check the results
Once the launcher pod shows Completed, check the output:
oc logs pi-launcher-<id>
You should see:
Workers: 2
Rank 0 on host pi-launcher
Rank 1 on host pi-launcher
pi is approximately 3.1410376000000002
The job completed successfully. Workers will terminate automatically after the launcher finishes.
5. Clean up
oc delete mpijob pi
Run your own MPI job
The pi example above is just a sanity check. To run your own code on DSRI as an MPI job you need to:
- Package your code and MPI dependencies in a Docker image — see the DSRI guide for building Docker images
- Use the pi example YAML as a starting point and replace:
image- your Docker image containing your codecommandandargsunder the launcher - your executable or scriptreplicasunder Worker - number of workers you needresources- CPU and memory based on your workload
Always keep serviceAccountName: anyuid and runAsUser: 1000 in both the Launcher and Worker specs — these are required for MPI jobs to run correctly on DSRI.
Monitor and debug jobs
Check job status:
oc get mpijob
oc describe mpijob my-mpi-job
Check launcher logs (where your output will appear):
oc logs <launcher-pod-name>
Check worker logs:
oc logs <worker-pod-name>
See also
Feel free to contact us using the Topdesk Form to discuss the use of MPI on the DSRI or if you need help adapting your code to run as an MPI job.