[petsc-users] errors with hypre with MPI and multiple GPUs on a node
    Yesypenko, Anna 
    anna at oden.utexas.edu
       
    Fri Feb  2 14:12:27 CST 2024
    
    
  
Hi Junchao,
Unfortunately I don't have access to other cuda machines with multiple GPUs.
I'm pretty stuck, and I think running on a different machine would help isolate the issue.
I'm sharing the python script and the launch script that Victor wrote.
There is a comment in the launch script with the mpi command I was using to run the python script.
I configured hypre without unified memory. In case it's useful, I also attached the configure.log.
If the issue is with petsc/hypre, it may be in the environment variables described here (e.g. HYPRE_MEMORY_DEVICE):
https://github.com/hypre-space/hypre/wiki/GPUs
Thank you for helping me troubleshoot this issue!
Best,
Anna
________________________________
From: Junchao Zhang <junchao.zhang at gmail.com>
Sent: Thursday, February 1, 2024 9:07 PM
To: Yesypenko, Anna <anna at oden.utexas.edu>
Cc: Victor Eijkhout <eijkhout at tacc.utexas.edu>; petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node
Hi, Anna,
  Do you have other CUDA machines to try?  If you can share your test, then I will run on Polaris at Argonne to see if it is a petsc/hypre issue.  If not, then it must be a GPU-MPI binding problem on TACC.
  Thanks
--Junchao Zhang
On Thu, Feb 1, 2024 at 5:31 PM Yesypenko, Anna <anna at oden.utexas.edu<mailto:anna at oden.utexas.edu>> wrote:
Hi Victor, Junchao,
Thank you for providing the script, it is very useful!
There are still issues with hypre not binding correctly, and I'm getting the error message occasionally (but much less often).
I added some additional environment variables to the script that seem to make the behavior more consistent.
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK    ## as Victor suggested
export HYPRE_MEMORY_DEVICE=$MV2_COMM_WORLD_LOCAL_RANK
The last environment variable is from hypre's documentation on GPUs.
In 30 runs for a small problem size, 4 fail with a hypre-related error. Do you have any other thoughts or suggestions?
Best,
Anna
________________________________
From: Victor Eijkhout <eijkhout at tacc.utexas.edu<mailto:eijkhout at tacc.utexas.edu>>
Sent: Thursday, February 1, 2024 11:26 AM
To: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>; Yesypenko, Anna <anna at oden.utexas.edu<mailto:anna at oden.utexas.edu>>
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node
Only for mvapich2-gdr:
#!/bin/bash
# Usage: mpirun -n <num_proc> MV2_USE_AFFINITY=0 MV2_ENABLE_AFFINITY=0 ./launch ./bin
export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK
case $MV2_COMM_WORLD_LOCAL_RANK in
        [0]) cpus=0-3 ;;
        [1]) cpus=64-67 ;;
        [2]) cpus=72-75 ;;
esac
numactl --physcpubind=$cpus $@
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240202/4d893bf6/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: configure.log
Type: text/x-log
Size: 2595931 bytes
Desc: configure.log
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240202/4d893bf6/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: launch
Type: application/octet-stream
Size: 377 bytes
Desc: launch
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240202/4d893bf6/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_script.py
Type: text/x-python
Size: 1604 bytes
Desc: test_script.py
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240202/4d893bf6/attachment-0001.py>
    
    
More information about the petsc-users
mailing list