[petsc-users] VecNorm causes program to hang
Barry Smith
bsmith at petsc.dev
Fri Nov 17 11:09:37 CST 2023
So the "bug" is not as ginormous as I originally thought. It will never produce incorrect results but can result in the errors you received.
The problem is
if (row_rank == 0)
{
PetscCall(VecCUDAReplaceArray(v, d_a));
}
The place/replacearray routines are actually collective; and need to be called by all MPI processes that own a vector regardless of the local size. This is because the call can invalidate the previously known norm values that have been cached in the vector. If the norm values are invalidated on some MPI processes but not others you will get the error you have seen.
Barry
I will prepare a branch with better documentation and clearer error handling for this situation.
> On Nov 16, 2023, at 6:30 PM, Barry Smith <bsmith at petsc.dev> wrote:
>
>
> Congratulations you have found a ginormous bug in PETSc! Thanks for the detail information on the problem.
>
> I will post a fix shortly.
>
> Barry
>
>
>> On Nov 16, 2023, at 6:19 PM, Sreeram R Venkat <srvenkat at utexas.edu> wrote:
>>
>> I have a program which reads a vector from file into an array, and then uses that array to create a PETSc Vec object. The Vec is defined on the global communicator, but not all processes actually contain entries of it. For example, suppose we have 4 processors, and the vector is of size 10. Rank 0 will contain entries 0-4 and Rank 1 will contain entries 5-9. Ranks 2 and 3 will not have any entries of the Vec.
>>
>> This Vec is then used as an input to other parts of the code, and those work fine. However, if I try to take the norm of the Vec with VecNorm(), I get the error
>>
>> `MPI_Allreduce() called in different locations (code lines) on different processors`
>>
>> The stack trace shows that ranks 0 and 1 (from the above example) are still in the VecNorm() function while ranks 2 and 3 have moved on to a later part of the code. If I add a PetscBarrier() after the VecNorm(), I find that the program hangs.
>>
>> The funny thing is that part of the code duplicates the Vec with VecDuplicate() and assigns to the duplicated vector the result of some computations. The duplicated Vec has the same layout as the original Vec, but taking VecNorm() on the duplicated Vec works fine. If I use VecCopy(), however, the copied Vec also causes VecNorm() to hang. I've printed out the original Vec, and there are no corrupted/NaN entries.
>>
>> I have a temporary workaround where I perturb the original Vec slightly before copying it to another Vec. This causes the program to successfully terminate.
>>
>> Any advice on how to get VecNorm() working with the original Vec?
>>
>> Thanks,
>> Sreeram
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231117/2c5f190a/attachment.html>
More information about the petsc-users
mailing list