[petsc-users] VecNorm causes program to hang

Thu Nov 16 20:41:38 CST 2023

Actually, here's a short test case I just made.
I have it on a git repo: https://github.com/s769/petsc-test

I put some instructions for how to build and run, but if there are issues,
please let me know.

In this small test code, I noticed that there are some CUDA memory errors
in the VecAXPY() line if the proc_cols variable is not 1. Still trying to
figure out what might be causing that, but in the meantime, the code I have
up there hangs for proc_rows=3, proc_cols=1, n=10 when we try to get the
norm of the Vec.

Hope this helps.

Thanks,
Sreeram

On Thu, Nov 16, 2023 at 8:38 PM Sreeram R Venkat <srvenkat at utexas.edu>
wrote:

> Ok, will do. It may take me a few days to get a minimal reproducible
> example though since the rest of the program has gotten quite large.
>
> Thanks,
> Sreeram
>
> On Thu, Nov 16, 2023 at 8:27 PM Matthew Knepley <knepley at gmail.com> wrote:
>
>> On Thu, Nov 16, 2023 at 6:19 PM Sreeram R Venkat <srvenkat at utexas.edu>
>> wrote:
>>
>>> I have a program which reads a vector from file into an array, and then
>>> uses that array to create a PETSc Vec object. The Vec is defined on the
>>> global communicator, but not all processes actually contain entries of it.
>>> For example, suppose we have 4 processors, and the vector is of size 10.
>>> Rank 0 will contain entries 0-4 and Rank 1 will contain entries 5-9. Ranks
>>> 2 and 3 will not have any entries of the Vec.
>>>
>>> This Vec is then used as an input to other parts of the code, and those
>>> work fine. However, if I try to take the norm of the Vec with VecNorm(), I
>>> get the error
>>>
>>> `MPI_Allreduce() called in different locations (code lines) on different
>>> processors`
>>>
>>> The stack trace shows that ranks 0 and 1 (from the above example) are
>>> still in the VecNorm() function while ranks 2 and 3 have moved on to a
>>> later part of the code. If I add a PetscBarrier() after the VecNorm(), I
>>> find that the program hangs.
>>>
>>> The funny thing is that part of the code duplicates the Vec with
>>> VecDuplicate() and assigns to the duplicated vector the result of some
>>> computations. The duplicated Vec has the same layout as the original Vec,
>>> but taking VecNorm() on the duplicated Vec works fine. If I use VecCopy(),
>>> however, the copied Vec also causes VecNorm() to hang. I've printed out the
>>> original Vec, and there are no corrupted/NaN entries.
>>>
>>> I have a temporary workaround where I perturb the original Vec slightly
>>> before copying it to another Vec. This causes the program to successfully
>>> terminate.
>>>
>>> Any advice on how to get VecNorm() working with the original Vec?
>>>
>>
>> Vecs with empty layouts work fine, so it must be something else about how
>> it is created.
>>
>> In order to track it down, I would first make a short program that just
>> creates the Vec as you say and see if it hangs. If so, just send it and we
>> will debug it. If not, I would systematically cut down your program until
>> you get something that hangs that you can send to us.
>>
>>   Thanks,
>>
>>      Matt
>>
>>
>>> Thanks,
>>> Sreeram
>>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231116/d39d5fc1/attachment.html>