[petsc-users] errors with hypre with MPI and multiple GPUs on a node

Yesypenko, Anna anna at oden.utexas.edu
Fri Feb 2 14:12:27 CST 2024


Hi Junchao,

Unfortunately I don't have access to other cuda machines with multiple GPUs.
I'm pretty stuck, and I think running on a different machine would help isolate the issue.

I'm sharing the python script and the launch script that Victor wrote.
There is a comment in the launch script with the mpi command I was using to run the python script.
I configured hypre without unified memory. In case it's useful, I also attached the configure.log.

If the issue is with petsc/hypre, it may be in the environment variables described here (e.g. HYPRE_MEMORY_DEVICE):
https://github.com/hypre-space/hypre/wiki/GPUs

Thank you for helping me troubleshoot this issue!
Best,
Anna






________________________________
From: Junchao Zhang <junchao.zhang at gmail.com>
Sent: Thursday, February 1, 2024 9:07 PM
To: Yesypenko, Anna <anna at oden.utexas.edu>
Cc: Victor Eijkhout <eijkhout at tacc.utexas.edu>; petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node

Hi, Anna,
  Do you have other CUDA machines to try?  If you can share your test, then I will run on Polaris at Argonne to see if it is a petsc/hypre issue.  If not, then it must be a GPU-MPI binding problem on TACC.

  Thanks
--Junchao Zhang


On Thu, Feb 1, 2024 at 5:31 PM Yesypenko, Anna <anna at oden.utexas.edu<mailto:anna at oden.utexas.edu>> wrote:
Hi Victor, Junchao,

Thank you for providing the script, it is very useful!
There are still issues with hypre not binding correctly, and I'm getting the error message occasionally (but much less often).
I added some additional environment variables to the script that seem to make the behavior more consistent.

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK    ## as Victor suggested
export HYPRE_MEMORY_DEVICE=$MV2_COMM_WORLD_LOCAL_RANK

The last environment variable is from hypre's documentation on GPUs.
In 30 runs for a small problem size, 4 fail with a hypre-related error. Do you have any other thoughts or suggestions?

Best,
Anna

________________________________
From: Victor Eijkhout <eijkhout at tacc.utexas.edu<mailto:eijkhout at tacc.utexas.edu>>
Sent: Thursday, February 1, 2024 11:26 AM
To: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>; Yesypenko, Anna <anna at oden.utexas.edu<mailto:anna at oden.utexas.edu>>
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node


Only for mvapich2-gdr:



#!/bin/bash

# Usage: mpirun -n <num_proc> MV2_USE_AFFINITY=0 MV2_ENABLE_AFFINITY=0 ./launch ./bin



export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK

case $MV2_COMM_WORLD_LOCAL_RANK in

        [0]) cpus=0-3 ;;

        [1]) cpus=64-67 ;;

        [2]) cpus=72-75 ;;

esac



numactl --physcpubind=$cpus $@


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240202/4d893bf6/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: configure.log
Type: text/x-log
Size: 2595931 bytes
Desc: configure.log
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240202/4d893bf6/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: launch
Type: application/octet-stream
Size: 377 bytes
Desc: launch
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240202/4d893bf6/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_script.py
Type: text/x-python
Size: 1604 bytes
Desc: test_script.py
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240202/4d893bf6/attachment-0001.py>


More information about the petsc-users mailing list