[petsc-users] VecNorm causes program to hang

Sreeram R Venkat srvenkat at utexas.edu
Fri Nov 17 11:05:10 CST 2023


I've updated the buildme script to specify the MPI and CUDA compilers.
Please make sure those modules are loaded, and let me know if it works.

Thanks,
Sreeram

On Fri, Nov 17, 2023 at 9:32 AM Mark Adams <mfadams at lbl.gov> wrote:

> I get this error:
>
> (base) 06:30 2 login10 master= perlmutter:~/petsc-test$ bash -x buildme.sh
> + '[' -z '' ']'
> + case "$-" in
> + __lmod_vx=x
> + '[' -n x ']'
> + set +x
> Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for this
> output (/opt/cray/pe/lmod/lmod/init/bash)
> Shell debugging restarted
> + unset __lmod_vx
> + git pull
> Already up to date.
> + cmake .
> -- Configuring done
> -- Generating done
> -- Build files have been written to: /global/homes/m/madams/petsc-test
> + make -j
> [ 33%] Building CUDA object CMakeFiles/test.dir/main.cu.o
> In file included from /global/homes/m/madams/petsc/include/petscbag.h:3,
>                  from /global/homes/m/madams/petsc/include/petsc.h:6,
>                  from /global/homes/m/madams/petsc-test/shared.cuh:8,
>                  from /global/homes/m/madams/petsc-test/main.cu:1:
> /global/homes/m/madams/petsc/include/petscsys.h:65:12: fatal error: mpi.h:
> No such file or directory
>    65 |   #include <mpi.h>
>       |            ^~~~~~~
> compilation terminated.
> make[2]: *** [CMakeFiles/test.dir/build.make:76:
> CMakeFiles/test.dir/main.cu.o] Error 1
> make[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/test.dir/all] Error 2
> make: *** [Makefile:91: all] Error 2
> (base) 06:31 2 login10 master= perlmutter:~/petsc-test$
>
>
> On Thu, Nov 16, 2023 at 9:42 PM Sreeram R Venkat <srvenkat at utexas.edu>
> wrote:
>
>> Actually, here's a short test case I just made.
>> I have it on a git repo: https://github.com/s769/petsc-test
>>
>> I put some instructions for how to build and run, but if there are
>> issues, please let me know.
>>
>> In this small test code, I noticed that there are some CUDA memory errors
>> in the VecAXPY() line if the proc_cols variable is not 1. Still trying to
>> figure out what might be causing that, but in the meantime, the code I have
>> up there hangs for proc_rows=3, proc_cols=1, n=10 when we try to get the
>> norm of the Vec.
>>
>> Hope this helps.
>>
>> Thanks,
>> Sreeram
>>
>> On Thu, Nov 16, 2023 at 8:38 PM Sreeram R Venkat <srvenkat at utexas.edu>
>> wrote:
>>
>>> Ok, will do. It may take me a few days to get a minimal reproducible
>>> example though since the rest of the program has gotten quite large.
>>>
>>> Thanks,
>>> Sreeram
>>>
>>> On Thu, Nov 16, 2023 at 8:27 PM Matthew Knepley <knepley at gmail.com>
>>> wrote:
>>>
>>>> On Thu, Nov 16, 2023 at 6:19 PM Sreeram R Venkat <srvenkat at utexas.edu>
>>>> wrote:
>>>>
>>>>> I have a program which reads a vector from file into an array, and
>>>>> then uses that array to create a PETSc Vec object. The Vec is defined on
>>>>> the global communicator, but not all processes actually contain entries of
>>>>> it. For example, suppose we have 4 processors, and the vector is of size
>>>>> 10. Rank 0 will contain entries 0-4 and Rank 1 will contain entries 5-9.
>>>>> Ranks 2 and 3 will not have any entries of the Vec.
>>>>>
>>>>> This Vec is then used as an input to other parts of the code, and
>>>>> those work fine. However, if I try to take the norm of the Vec with
>>>>> VecNorm(), I get the error
>>>>>
>>>>> `MPI_Allreduce() called in different locations (code lines) on
>>>>> different processors`
>>>>>
>>>>> The stack trace shows that ranks 0 and 1 (from the above example) are
>>>>> still in the VecNorm() function while ranks 2 and 3 have moved on to a
>>>>> later part of the code. If I add a PetscBarrier() after the VecNorm(), I
>>>>> find that the program hangs.
>>>>>
>>>>> The funny thing is that part of the code duplicates the Vec with
>>>>> VecDuplicate() and assigns to the duplicated vector the result of some
>>>>> computations. The duplicated Vec has the same layout as the original Vec,
>>>>> but taking VecNorm() on the duplicated Vec works fine. If I use VecCopy(),
>>>>> however, the copied Vec also causes VecNorm() to hang. I've printed out the
>>>>> original Vec, and there are no corrupted/NaN entries.
>>>>>
>>>>> I have a temporary workaround where I perturb the original Vec
>>>>> slightly before copying it to another Vec. This causes the program to
>>>>> successfully terminate.
>>>>>
>>>>> Any advice on how to get VecNorm() working with the original Vec?
>>>>>
>>>>
>>>> Vecs with empty layouts work fine, so it must be something else about
>>>> how it is created.
>>>>
>>>> In order to track it down, I would first make a short program that just
>>>> creates the Vec as you say and see if it hangs. If so, just send it and we
>>>> will debug it. If not, I would systematically cut down your program until
>>>> you get something that hangs that you can send to us.
>>>>
>>>>   Thanks,
>>>>
>>>>      Matt
>>>>
>>>>
>>>>> Thanks,
>>>>> Sreeram
>>>>>
>>>>
>>>>
>>>> --
>>>> What most experimenters take for granted before they begin their
>>>> experiments is infinitely more interesting than any results to which their
>>>> experiments lead.
>>>> -- Norbert Wiener
>>>>
>>>> https://www.cse.buffalo.edu/~knepley/
>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231117/229a5ae3/attachment.html>


More information about the petsc-users mailing list