[petsc-users] errors with hypre with MPI and multiple GPUs on a node

Sun Feb 4 19:55:52 CST 2024

Hi Junchao, Victor,

I fixed the issue! The issue was with the CPU bindings. Python has a limitation that it only runs on one core.
I had to modify the MPI thread launch script to make sure that each python instance is bound to only one physical core.

Thank you both very much for your patience and help!

Best,
Anna
________________________________
From: Yesypenko, Anna <anna at oden.utexas.edu>
Sent: Friday, February 2, 2024 2:12 PM
To: Junchao Zhang <junchao.zhang at gmail.com>
Cc: Victor Eijkhout <eijkhout at tacc.utexas.edu>; petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node

Hi Junchao,

Unfortunately I don't have access to other cuda machines with multiple GPUs.
I'm pretty stuck, and I think running on a different machine would help isolate the issue.

I'm sharing the python script and the launch script that Victor wrote.
There is a comment in the launch script with the mpi command I was using to run the python script.
I configured hypre without unified memory. In case it's useful, I also attached the configure.log.

If the issue is with petsc/hypre, it may be in the environment variables described here (e.g. HYPRE_MEMORY_DEVICE):
https://github.com/hypre-space/hypre/wiki/GPUs

Thank you for helping me troubleshoot this issue!
Best,
Anna

________________________________
From: Junchao Zhang <junchao.zhang at gmail.com>
Sent: Thursday, February 1, 2024 9:07 PM
To: Yesypenko, Anna <anna at oden.utexas.edu>
Cc: Victor Eijkhout <eijkhout at tacc.utexas.edu>; petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node

Hi, Anna,
  Do you have other CUDA machines to try?  If you can share your test, then I will run on Polaris at Argonne to see if it is a petsc/hypre issue.  If not, then it must be a GPU-MPI binding problem on TACC.

  Thanks
--Junchao Zhang

On Thu, Feb 1, 2024 at 5:31 PM Yesypenko, Anna <anna at oden.utexas.edu<mailto:anna at oden.utexas.edu>> wrote:
Hi Victor, Junchao,

Thank you for providing the script, it is very useful!
There are still issues with hypre not binding correctly, and I'm getting the error message occasionally (but much less often).
I added some additional environment variables to the script that seem to make the behavior more consistent.

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK    ## as Victor suggested
export HYPRE_MEMORY_DEVICE=$MV2_COMM_WORLD_LOCAL_RANK

The last environment variable is from hypre's documentation on GPUs.
In 30 runs for a small problem size, 4 fail with a hypre-related error. Do you have any other thoughts or suggestions?

Best,
Anna

________________________________
From: Victor Eijkhout <eijkhout at tacc.utexas.edu<mailto:eijkhout at tacc.utexas.edu>>
Sent: Thursday, February 1, 2024 11:26 AM
To: Junchao Zhang <junchao.zhang at gmail.com<mailto:junchao.zhang at gmail.com>>; Yesypenko, Anna <anna at oden.utexas.edu<mailto:anna at oden.utexas.edu>>
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node

Only for mvapich2-gdr:

#!/bin/bash

# Usage: mpirun -n <num_proc> MV2_USE_AFFINITY=0 MV2_ENABLE_AFFINITY=0 ./launch ./bin

export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK

case $MV2_COMM_WORLD_LOCAL_RANK in

        [0]) cpus=0-3 ;;

        [1]) cpus=64-67 ;;

        [2]) cpus=72-75 ;;

esac

numactl --physcpubind=$cpus $@

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240205/7687d2ce/attachment-0001.html>