[petsc-users] errors with hypre with MPI and multiple GPUs on a node
Yesypenko, Anna
anna at oden.utexas.edu
Sun Feb 4 19:55:52 CST 2024
Hi Junchao, Victor,
I fixed the issue! The issue was with the CPU bindings. Python has a limitation that it only runs on one core.
I had to modify the MPI thread launch script to make sure that each python instance is bound to only one physical core.
Thank you both very much for your patience and help!
Best,
Anna
________________________________
From: Yesypenko, Anna <anna at oden.utexas.edu>
Sent: Friday, February 2, 2024 2:12 PM
To: Junchao Zhang <junchao.zhang at gmail.com>
Cc: Victor Eijkhout <eijkhout at tacc.utexas.edu>; petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node
Hi Junchao,
Unfortunately I don't have access to other cuda machines with multiple GPUs.
I'm pretty stuck, and I think running on a different machine would help isolate the issue.
I'm sharing the python script and the launch script that Victor wrote.
There is a comment in the launch script with the mpi command I was using to run the python script.
I configured hypre without unified memory. In case it's useful, I also attached the configure.log.
If the issue is with petsc/hypre, it may be in the environment variables described here (e.g. HYPRE_MEMORY_DEVICE):
https://github.com/hypre-space/hypre/wiki/GPUs
Thank you for helping me troubleshoot this issue!
Best,
Anna
________________________________
From: Junchao Zhang <junchao.zhang at gmail.com>
Sent: Thursday, February 1, 2024 9:07 PM
To: Yesypenko, Anna <anna at oden.utexas.edu>
Cc: Victor Eijkhout <eijkhout at tacc.utexas.edu>; petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node
Hi, Anna,
Do you have other CUDA machines to try? If you can share your test, then I will run on Polaris at Argonne to see if it is a petsc/hypre issue. If not, then it must be a GPU-MPI binding problem on TACC.
Thanks
--Junchao Zhang
On Thu, Feb 1, 2024 at 5:31 PM Yesypenko, Anna <anna at oden.utexas.edu<mailto:anna at oden.utexas.edu>> wrote:
Hi Victor, Junchao,
Thank you for providing the script, it is very useful!
There are still issues with hypre not binding correctly, and I'm getting the error message occasionally (but much less often).
I added some additional environment variables to the script that seem to make the behavior more consistent.
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK ## as Victor suggested
export HYPRE_MEMORY_DEVICE=$MV2_COMM_WORLD_LOCAL_RANK
The last environment variable is from hypre's documentation on GPUs.
In 30 runs for a small problem size, 4 fail with a hypre-related error. Do you have any other thoughts or suggestions?
Best,
Anna
________________________________
From: Victor Eijkhout <eijkhout at tacc.utexas.edu<mailto:eijkhout at tacc.utexas.edu>>
Sent: Thursday, February 1, 2024 11:26 AM
To: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>; Yesypenko, Anna <anna at oden.utexas.edu<mailto:anna at oden.utexas.edu>>
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node
Only for mvapich2-gdr:
#!/bin/bash
# Usage: mpirun -n <num_proc> MV2_USE_AFFINITY=0 MV2_ENABLE_AFFINITY=0 ./launch ./bin
export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK
case $MV2_COMM_WORLD_LOCAL_RANK in
[0]) cpus=0-3 ;;
[1]) cpus=64-67 ;;
[2]) cpus=72-75 ;;
esac
numactl --physcpubind=$cpus $@
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240205/7687d2ce/attachment-0001.html>
More information about the petsc-users
mailing list