[petsc-users] errors with hypre with MPI and multiple GPUs on a node
Yesypenko, Anna
anna at oden.utexas.edu
Thu Feb 1 17:31:38 CST 2024
Hi Victor, Junchao,
Thank you for providing the script, it is very useful!
There are still issues with hypre not binding correctly, and I'm getting the error message occasionally (but much less often).
I added some additional environment variables to the script that seem to make the behavior more consistent.
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK ## as Victor suggested
export HYPRE_MEMORY_DEVICE=$MV2_COMM_WORLD_LOCAL_RANK
The last environment variable is from hypre's documentation on GPUs.
In 30 runs for a small problem size, 4 fail with a hypre-related error. Do you have any other thoughts or suggestions?
Best,
Anna
________________________________
From: Victor Eijkhout <eijkhout at tacc.utexas.edu>
Sent: Thursday, February 1, 2024 11:26 AM
To: Junchao Zhang <junchao.zhang at gmail.com>; Yesypenko, Anna <anna at oden.utexas.edu>
Cc: petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node
Only for mvapich2-gdr:
#!/bin/bash
# Usage: mpirun -n <num_proc> MV2_USE_AFFINITY=0 MV2_ENABLE_AFFINITY=0 ./launch ./bin
export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK
case $MV2_COMM_WORLD_LOCAL_RANK in
[0]) cpus=0-3 ;;
[1]) cpus=64-67 ;;
[2]) cpus=72-75 ;;
esac
numactl --physcpubind=$cpus $@
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240201/21acee01/attachment.html>
More information about the petsc-users
mailing list