[petsc-users] VecNorm causes program to hang

Mark Adams mfadams at lbl.gov
Fri Nov 17 08:32:41 CST 2023


I get this error:

(base) 06:30 2 login10 master= perlmutter:~/petsc-test$ bash -x buildme.sh
+ '[' -z '' ']'
+ case "$-" in
+ __lmod_vx=x
+ '[' -n x ']'
+ set +x
Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for this
output (/opt/cray/pe/lmod/lmod/init/bash)
Shell debugging restarted
+ unset __lmod_vx
+ git pull
Already up to date.
+ cmake .
-- Configuring done
-- Generating done
-- Build files have been written to: /global/homes/m/madams/petsc-test
+ make -j
[ 33%] Building CUDA object CMakeFiles/test.dir/main.cu.o
In file included from /global/homes/m/madams/petsc/include/petscbag.h:3,
                 from /global/homes/m/madams/petsc/include/petsc.h:6,
                 from /global/homes/m/madams/petsc-test/shared.cuh:8,
                 from /global/homes/m/madams/petsc-test/main.cu:1:
/global/homes/m/madams/petsc/include/petscsys.h:65:12: fatal error: mpi.h:
No such file or directory
   65 |   #include <mpi.h>
      |            ^~~~~~~
compilation terminated.
make[2]: *** [CMakeFiles/test.dir/build.make:76:
CMakeFiles/test.dir/main.cu.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/test.dir/all] Error 2
make: *** [Makefile:91: all] Error 2
(base) 06:31 2 login10 master= perlmutter:~/petsc-test$


On Thu, Nov 16, 2023 at 9:42 PM Sreeram R Venkat <srvenkat at utexas.edu>
wrote:

> Actually, here's a short test case I just made.
> I have it on a git repo: https://github.com/s769/petsc-test
>
> I put some instructions for how to build and run, but if there are issues,
> please let me know.
>
> In this small test code, I noticed that there are some CUDA memory errors
> in the VecAXPY() line if the proc_cols variable is not 1. Still trying to
> figure out what might be causing that, but in the meantime, the code I have
> up there hangs for proc_rows=3, proc_cols=1, n=10 when we try to get the
> norm of the Vec.
>
> Hope this helps.
>
> Thanks,
> Sreeram
>
> On Thu, Nov 16, 2023 at 8:38 PM Sreeram R Venkat <srvenkat at utexas.edu>
> wrote:
>
>> Ok, will do. It may take me a few days to get a minimal reproducible
>> example though since the rest of the program has gotten quite large.
>>
>> Thanks,
>> Sreeram
>>
>> On Thu, Nov 16, 2023 at 8:27 PM Matthew Knepley <knepley at gmail.com>
>> wrote:
>>
>>> On Thu, Nov 16, 2023 at 6:19 PM Sreeram R Venkat <srvenkat at utexas.edu>
>>> wrote:
>>>
>>>> I have a program which reads a vector from file into an array, and then
>>>> uses that array to create a PETSc Vec object. The Vec is defined on the
>>>> global communicator, but not all processes actually contain entries of it.
>>>> For example, suppose we have 4 processors, and the vector is of size 10.
>>>> Rank 0 will contain entries 0-4 and Rank 1 will contain entries 5-9. Ranks
>>>> 2 and 3 will not have any entries of the Vec.
>>>>
>>>> This Vec is then used as an input to other parts of the code, and those
>>>> work fine. However, if I try to take the norm of the Vec with VecNorm(), I
>>>> get the error
>>>>
>>>> `MPI_Allreduce() called in different locations (code lines) on
>>>> different processors`
>>>>
>>>> The stack trace shows that ranks 0 and 1 (from the above example) are
>>>> still in the VecNorm() function while ranks 2 and 3 have moved on to a
>>>> later part of the code. If I add a PetscBarrier() after the VecNorm(), I
>>>> find that the program hangs.
>>>>
>>>> The funny thing is that part of the code duplicates the Vec with
>>>> VecDuplicate() and assigns to the duplicated vector the result of some
>>>> computations. The duplicated Vec has the same layout as the original Vec,
>>>> but taking VecNorm() on the duplicated Vec works fine. If I use VecCopy(),
>>>> however, the copied Vec also causes VecNorm() to hang. I've printed out the
>>>> original Vec, and there are no corrupted/NaN entries.
>>>>
>>>> I have a temporary workaround where I perturb the original Vec slightly
>>>> before copying it to another Vec. This causes the program to successfully
>>>> terminate.
>>>>
>>>> Any advice on how to get VecNorm() working with the original Vec?
>>>>
>>>
>>> Vecs with empty layouts work fine, so it must be something else about
>>> how it is created.
>>>
>>> In order to track it down, I would first make a short program that just
>>> creates the Vec as you say and see if it hangs. If so, just send it and we
>>> will debug it. If not, I would systematically cut down your program until
>>> you get something that hangs that you can send to us.
>>>
>>>   Thanks,
>>>
>>>      Matt
>>>
>>>
>>>> Thanks,
>>>> Sreeram
>>>>
>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> <http://www.cse.buffalo.edu/~knepley/>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231117/3dca2c32/attachment.html>


More information about the petsc-users mailing list