[petsc-users] Configure error while building PETSc with CUDA/MVAPICH2-GDR

Fri Apr 19 12:48:50 CDT 2024

Thanks for the trick. We can prepare the example script for Lonestar6 and
mention it.

--Junchao Zhang

On Fri, Apr 19, 2024 at 11:55 AM Sreeram R Venkat <srvenkat at utexas.edu>
wrote:

> I talked to the MVAPICH people, and they told me to try adding
> /path/to/mvapich2-gdr/lib64/libmpi.so to LD_PRELOAD (apparently, they've
> had this issue before). This seemed to do the trick; I can build everything
> with MVAPICH2-GDR and run with it now. Not sure if this is something you
> want to add to the docs.
>
> Thanks,
> Sreeram
>
> On Wed, Apr 17, 2024 at 9:17 AM Junchao Zhang <junchao.zhang at gmail.com>
> wrote:
>
>> I looked at it before and checked again, and still see
>> https://urldefense.us/v3/__https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html*inter-gpu-communication-with-cuda-aware-mpi__;Iw!!G_uCfscf7eWS!a7KaPIMnK3W6fKk-LnqLXX2RuRqPkLf7VqOFMTbTer2ssQFCasDyKoFrz3cZDwHMUxFHFHYcHAp3JQLue4dkR3JyE_IH$ 
>>  > Using both MPI and NCCL to perform transfers between the same sets of
>> CUDA devices concurrently is therefore not guaranteed to be safe.
>>
>> I was scared by it.  It means we have to replace all MPI device
>> communications (what if they are from a third-party library?) with NCCL.
>>
>> --Junchao Zhang
>>
>>
>> On Wed, Apr 17, 2024 at 8:27 AM Sreeram R Venkat <srvenkat at utexas.edu>
>> wrote:
>>
>>> Yes, I saw this paper
>>> https://urldefense.us/v3/__https://www.sciencedirect.com/science/article/abs/pii/S016781912100079X__;!!G_uCfscf7eWS!a7KaPIMnK3W6fKk-LnqLXX2RuRqPkLf7VqOFMTbTer2ssQFCasDyKoFrz3cZDwHMUxFHFHYcHAp3JQLue4dkRxedk29J$ 
>>> that mentioned it, and I heard in Barry's talk at SIAM PP this year
>>> about the need for stream-aware MPI, so I was wondering if NCCL would be
>>> used in PETSc to do GPU-GPU communication.
>>>
>>> On Wed, Apr 17, 2024, 7:58 AM Junchao Zhang <junchao.zhang at gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Apr 17, 2024 at 7:51 AM Sreeram R Venkat <srvenkat at utexas.edu>
>>>> wrote:
>>>>
>>>>> Do you know if there are plans for NCCL support in PETSc?
>>>>>
>>>> What is your need?  Do you mean using NCCL for the MPI communication?
>>>>
>>>>
>>>>>
>>>>> On Tue, Apr 16, 2024, 10:41 PM Junchao Zhang <junchao.zhang at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Glad to hear you found a way.   Did you use Frontera at TACC?  If
>>>>>> yes, I could have a try.
>>>>>>
>>>>>> --Junchao Zhang
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 16, 2024 at 8:35 PM Sreeram R Venkat <srvenkat at utexas.edu>
>>>>>> wrote:
>>>>>>
>>>>>>> I finally figured out a way to make it work. I had to build PETSc
>>>>>>> and my application using the (non GPU-aware) Intel MPI. Then, before
>>>>>>> running, I switch to the MVAPICH2-GDR. I'm not sure why that works, but
>>>>>>> it's the only way I've
>>>>>>> ZjQcmQRYFpfptBannerStart
>>>>>>> This Message Is From an External Sender
>>>>>>> This message came from outside your organization.
>>>>>>>
>>>>>>> ZjQcmQRYFpfptBannerEnd
>>>>>>> I finally figured out a way to make it work. I had to build PETSc
>>>>>>> and my application using the (non GPU-aware) Intel MPI. Then, before
>>>>>>> running, I switch to the MVAPICH2-GDR.
>>>>>>> I'm not sure why that works, but it's the only way I've found to
>>>>>>> compile and run successfully without throwing any errors about not having a
>>>>>>> GPU-aware MPI.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Dec 8, 2023 at 5:30 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>>>>>
>>>>>>>> You may need to set some env variables. This can be system specific
>>>>>>>> so you might want to look at docs or ask TACC how to run with GPU-aware MPI.
>>>>>>>>
>>>>>>>> Mark
>>>>>>>>
>>>>>>>> On Fri, Dec 8, 2023 at 5:17 PM Sreeram R Venkat <
>>>>>>>> srvenkat at utexas.edu> wrote:
>>>>>>>>
>>>>>>>>> Actually, when I compile my program with this build of PETSc and
>>>>>>>>> run, I still get the error:
>>>>>>>>>
>>>>>>>>> PETSC ERROR: PETSc is configured with GPU support, but your MPI is
>>>>>>>>> not GPU-aware. For better performance, please use a GPU-aware MPI.
>>>>>>>>>
>>>>>>>>> I have the mvapich2-gdr module loaded and MV2_USE_CUDA=1.
>>>>>>>>>
>>>>>>>>> Is there anything else I need to do?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Sreeram
>>>>>>>>>
>>>>>>>>> On Fri, Dec 8, 2023 at 3:29 PM Sreeram R Venkat <
>>>>>>>>> srvenkat at utexas.edu> wrote:
>>>>>>>>>
>>>>>>>>>> Thank you, changing to CUDA 11.4 fixed the issue. The
>>>>>>>>>> mvapich2-gdr module didn't require CUDA 11.4 as a dependency, so I was
>>>>>>>>>> using 12.0
>>>>>>>>>>
>>>>>>>>>> On Fri, Dec 8, 2023 at 1:15 PM Satish Balay <balay at mcs.anl.gov>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Executing: mpicc -show
>>>>>>>>>>> stdout: icc -I/opt/apps/cuda/11.4/include
>>>>>>>>>>> -I/opt/apps/cuda/11.4/include -lcuda -L/opt/apps/cuda/11.4/lib64/stubs
>>>>>>>>>>> -L/opt/apps/cuda/11.4/lib64 -lcudart -lrt
>>>>>>>>>>> -Wl,-rpath,/opt/apps/cuda/11.4/lib64 -Wl,-rpath,XORIGIN/placeholder
>>>>>>>>>>> -Wl,--build-id -L/opt/apps/cuda/11.4/lib64/ -lm
>>>>>>>>>>> -I/opt/apps/intel19/mvapich2-gdr/2.3.7/include
>>>>>>>>>>> -L/opt/apps/intel19/mvapich2-gdr/2.3.7/lib64 -Wl,-rpath
>>>>>>>>>>> -Wl,/opt/apps/intel19/mvapich2-gdr/2.3.7/lib64 -Wl,--enable-new-dtags -lmpi
>>>>>>>>>>>
>>>>>>>>>>>     Checking for program /opt/apps/cuda/12.0/bin/nvcc...found
>>>>>>>>>>>
>>>>>>>>>>> Looks like you are trying to mix in 2 different cuda versions in
>>>>>>>>>>> this build.
>>>>>>>>>>>
>>>>>>>>>>> Perhaps you need to use cuda-11.4 - with this install of
>>>>>>>>>>> mvapich..
>>>>>>>>>>>
>>>>>>>>>>> Satish
>>>>>>>>>>>
>>>>>>>>>>> On Fri, 8 Dec 2023, Matthew Knepley wrote:
>>>>>>>>>>>
>>>>>>>>>>> > On Fri, Dec 8, 2023 at 1:54 PM Sreeram R Venkat <
>>>>>>>>>>> srvenkat at utexas.edu> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > > I am trying to build PETSc with CUDA using the CUDA-Aware
>>>>>>>>>>> MVAPICH2-GDR.
>>>>>>>>>>> > >
>>>>>>>>>>> > > Here is my configure command:
>>>>>>>>>>> > >
>>>>>>>>>>> > > ./configure PETSC_ARCH=linux-c-debug-mvapich2-gdr
>>>>>>>>>>> --download-hypre
>>>>>>>>>>> > >  --with-cuda=true --cuda-dir=$TACC_CUDA_DIR --with-hdf5=true
>>>>>>>>>>> > > --with-hdf5-dir=$TACC_PHDF5_DIR --download-elemental
>>>>>>>>>>> --download-metis
>>>>>>>>>>> > > --download-parmetis --with-cc=mpicc --with-cxx=mpicxx
>>>>>>>>>>> --with-fc=mpif90
>>>>>>>>>>> > >
>>>>>>>>>>> > > which errors with:
>>>>>>>>>>> > >
>>>>>>>>>>> > >           UNABLE to CONFIGURE with GIVEN OPTIONS (see
>>>>>>>>>>> configure.log for
>>>>>>>>>>> > > details):
>>>>>>>>>>> > >
>>>>>>>>>>> > >
>>>>>>>>>>> ---------------------------------------------------------------------------------------------
>>>>>>>>>>> > >   CUDA compile failed with arch flags " -ccbin mpic++
>>>>>>>>>>> -std=c++14
>>>>>>>>>>> > > -Xcompiler -fPIC
>>>>>>>>>>> > >   -Xcompiler -fvisibility=hidden -g -lineinfo -gencode
>>>>>>>>>>> > > arch=compute_80,code=sm_80"
>>>>>>>>>>> > >   generated from "--with-cuda-arch=80"
>>>>>>>>>>> > >
>>>>>>>>>>> > >
>>>>>>>>>>> > >
>>>>>>>>>>> > > The same configure command works when I use the Intel MPI
>>>>>>>>>>> and I can build
>>>>>>>>>>> > > with CUDA. The full config.log file is attached. Please let
>>>>>>>>>>> me know if you
>>>>>>>>>>> > > need any other information. I appreciate your help with this.
>>>>>>>>>>> > >
>>>>>>>>>>> >
>>>>>>>>>>> > The proximate error is
>>>>>>>>>>> >
>>>>>>>>>>> > Executing: nvcc -c -o
>>>>>>>>>>> /tmp/petsc-kn3f29gl/config.packages.cuda/conftest.o
>>>>>>>>>>> > -I/tmp/petsc-kn3f29gl/config.setCompilers
>>>>>>>>>>> > -I/tmp/petsc-kn3f29gl/config.types
>>>>>>>>>>> > -I/tmp/petsc-kn3f29gl/config.packages.cuda  -ccbin mpic++
>>>>>>>>>>> -std=c++14
>>>>>>>>>>> > -Xcompiler -fPIC -Xcompiler -fvisibility=hidden -g -lineinfo
>>>>>>>>>>> -gencode
>>>>>>>>>>> > arch=compute_80,code=sm_80
>>>>>>>>>>> /tmp/petsc-kn3f29gl/config.packages.cuda/
>>>>>>>>>>> > conftest.cu
>>>>>>>>>>> <https://urldefense.us/v3/__http://conftest.cu__;!!G_uCfscf7eWS!duKUz7pE9N0adJ-FOW7PLZ_1cSZvYlnqh7J0TIcZN0v8RLplcWxh1YE8Vis29K0cuw_zAvjdK-H9H2JYYuUUKRXxlA$>
>>>>>>>>>>> > stdout:
>>>>>>>>>>> > /opt/apps/cuda/11.4/include/crt/sm_80_rt.hpp(141): error: more
>>>>>>>>>>> than one
>>>>>>>>>>> > instance of overloaded function
>>>>>>>>>>> "__nv_associate_access_property_impl" has
>>>>>>>>>>> > "C" linkage
>>>>>>>>>>> > 1 error detected in the compilation of
>>>>>>>>>>> > "/tmp/petsc-kn3f29gl/config.packages.cuda/conftest.cu
>>>>>>>>>>> <https://urldefense.us/v3/__http://conftest.cu__;!!G_uCfscf7eWS!duKUz7pE9N0adJ-FOW7PLZ_1cSZvYlnqh7J0TIcZN0v8RLplcWxh1YE8Vis29K0cuw_zAvjdK-H9H2JYYuUUKRXxlA$>
>>>>>>>>>>> ".
>>>>>>>>>>> > Possible ERROR while running compiler: exit code 1
>>>>>>>>>>> > stderr:
>>>>>>>>>>> > /opt/apps/cuda/11.4/include/crt/sm_80_rt.hpp(141): error: more
>>>>>>>>>>> than one
>>>>>>>>>>> > instance of overloaded function
>>>>>>>>>>> "__nv_associate_access_property_impl" has
>>>>>>>>>>> > "C" linkage
>>>>>>>>>>> >
>>>>>>>>>>> > 1 error detected in the compilation of
>>>>>>>>>>> > "/tmp/petsc-kn3f29gl/config.packages.cuda
>>>>>>>>>>> >
>>>>>>>>>>> > This looks like screwed up headers to me, but I will let
>>>>>>>>>>> someone that
>>>>>>>>>>> > understands CUDA compilation reply.
>>>>>>>>>>> >
>>>>>>>>>>> >   Thanks,
>>>>>>>>>>> >
>>>>>>>>>>> >      Matt
>>>>>>>>>>> >
>>>>>>>>>>> > Thanks,
>>>>>>>>>>> > > Sreeram
>>>>>>>>>>> > >
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20240419/db27753a/attachment-0001.html>