[petsc-users] errors with hypre with MPI and multiple GPUs on a node

Yesypenko, Anna anna at oden.utexas.edu
Thu Feb 1 17:31:38 CST 2024


Hi Victor, Junchao,

Thank you for providing the script, it is very useful!
There are still issues with hypre not binding correctly, and I'm getting the error message occasionally (but much less often).
I added some additional environment variables to the script that seem to make the behavior more consistent.

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK    ## as Victor suggested
export HYPRE_MEMORY_DEVICE=$MV2_COMM_WORLD_LOCAL_RANK

The last environment variable is from hypre's documentation on GPUs.
In 30 runs for a small problem size, 4 fail with a hypre-related error. Do you have any other thoughts or suggestions?

Best,
Anna

________________________________
From: Victor Eijkhout <eijkhout at tacc.utexas.edu>
Sent: Thursday, February 1, 2024 11:26 AM
To: Junchao Zhang <junchao.zhang at gmail.com>; Yesypenko, Anna <anna at oden.utexas.edu>
Cc: petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node


Only for mvapich2-gdr:



#!/bin/bash

# Usage: mpirun -n <num_proc> MV2_USE_AFFINITY=0 MV2_ENABLE_AFFINITY=0 ./launch ./bin



export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK

case $MV2_COMM_WORLD_LOCAL_RANK in

        [0]) cpus=0-3 ;;

        [1]) cpus=64-67 ;;

        [2]) cpus=72-75 ;;

esac



numactl --physcpubind=$cpus $@


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240201/21acee01/attachment.html>


More information about the petsc-users mailing list