[petsc-users] errors with hypre with MPI and multiple GPUs on a node
Yesypenko, Anna
anna at oden.utexas.edu
Fri Feb 2 14:12:27 CST 2024
Hi Junchao,
Unfortunately I don't have access to other cuda machines with multiple GPUs.
I'm pretty stuck, and I think running on a different machine would help isolate the issue.
I'm sharing the python script and the launch script that Victor wrote.
There is a comment in the launch script with the mpi command I was using to run the python script.
I configured hypre without unified memory. In case it's useful, I also attached the configure.log.
If the issue is with petsc/hypre, it may be in the environment variables described here (e.g. HYPRE_MEMORY_DEVICE):
https://github.com/hypre-space/hypre/wiki/GPUs
Thank you for helping me troubleshoot this issue!
Best,
Anna
________________________________
From: Junchao Zhang <junchao.zhang at gmail.com>
Sent: Thursday, February 1, 2024 9:07 PM
To: Yesypenko, Anna <anna at oden.utexas.edu>
Cc: Victor Eijkhout <eijkhout at tacc.utexas.edu>; petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node
Hi, Anna,
Do you have other CUDA machines to try? If you can share your test, then I will run on Polaris at Argonne to see if it is a petsc/hypre issue. If not, then it must be a GPU-MPI binding problem on TACC.
Thanks
--Junchao Zhang
On Thu, Feb 1, 2024 at 5:31 PM Yesypenko, Anna <anna at oden.utexas.edu<mailto:anna at oden.utexas.edu>> wrote:
Hi Victor, Junchao,
Thank you for providing the script, it is very useful!
There are still issues with hypre not binding correctly, and I'm getting the error message occasionally (but much less often).
I added some additional environment variables to the script that seem to make the behavior more consistent.
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK ## as Victor suggested
export HYPRE_MEMORY_DEVICE=$MV2_COMM_WORLD_LOCAL_RANK
The last environment variable is from hypre's documentation on GPUs.
In 30 runs for a small problem size, 4 fail with a hypre-related error. Do you have any other thoughts or suggestions?
Best,
Anna
________________________________
From: Victor Eijkhout <eijkhout at tacc.utexas.edu<mailto:eijkhout at tacc.utexas.edu>>
Sent: Thursday, February 1, 2024 11:26 AM
To: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>; Yesypenko, Anna <anna at oden.utexas.edu<mailto:anna at oden.utexas.edu>>
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node
Only for mvapich2-gdr:
#!/bin/bash
# Usage: mpirun -n <num_proc> MV2_USE_AFFINITY=0 MV2_ENABLE_AFFINITY=0 ./launch ./bin
export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK
case $MV2_COMM_WORLD_LOCAL_RANK in
[0]) cpus=0-3 ;;
[1]) cpus=64-67 ;;
[2]) cpus=72-75 ;;
esac
numactl --physcpubind=$cpus $@
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240202/4d893bf6/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: configure.log
Type: text/x-log
Size: 2595931 bytes
Desc: configure.log
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240202/4d893bf6/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: launch
Type: application/octet-stream
Size: 377 bytes
Desc: launch
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240202/4d893bf6/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_script.py
Type: text/x-python
Size: 1604 bytes
Desc: test_script.py
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240202/4d893bf6/attachment-0001.py>
More information about the petsc-users
mailing list