[petsc-users] VecNorm causes program to hang

Sreeram R Venkat srvenkat at utexas.edu
Fri Nov 17 12:39:53 CST 2023


Thank you; that fixed the problem. I added an

else
{
        PetscCall(VecCUDAReplaceArray(v, NULL));
}

Thanks,
Sreeram


On Fri, Nov 17, 2023 at 12:09 PM Barry Smith <bsmith at petsc.dev> wrote:

>
>    So the "bug" is not as ginormous as I originally thought. It will never
> produce incorrect results but can result in the errors you received.
>
>    The problem is
>
> if (row_rank == 0)
>     {
>         PetscCall(VecCUDAReplaceArray(v, d_a));
>     }
>
> The place/replacearray routines are actually collective; and need to be
> called by all MPI processes that own a vector regardless of the local size.
> This is because the call can invalidate the previously known norm values
> that have been cached in the vector. If the norm values are invalidated on
> some MPI processes but not others you will get the error you have seen.
>
>   Barry
>
>   I will prepare a branch with better documentation and clearer error
> handling for this situation.
>
>
>
>
> On Nov 16, 2023, at 6:30 PM, Barry Smith <bsmith at petsc.dev> wrote:
>
>
>   Congratulations you have found a ginormous bug in PETSc! Thanks for the
> detail information on the problem.
>
>    I will post a fix shortly.
>
>    Barry
>
>
> On Nov 16, 2023, at 6:19 PM, Sreeram R Venkat <srvenkat at utexas.edu> wrote:
>
> I have a program which reads a vector from file into an array, and then
> uses that array to create a PETSc Vec object. The Vec is defined on the
> global communicator, but not all processes actually contain entries of it.
> For example, suppose we have 4 processors, and the vector is of size 10.
> Rank 0 will contain entries 0-4 and Rank 1 will contain entries 5-9. Ranks
> 2 and 3 will not have any entries of the Vec.
>
> This Vec is then used as an input to other parts of the code, and those
> work fine. However, if I try to take the norm of the Vec with VecNorm(), I
> get the error
>
> `MPI_Allreduce() called in different locations (code lines) on different
> processors`
>
> The stack trace shows that ranks 0 and 1 (from the above example) are
> still in the VecNorm() function while ranks 2 and 3 have moved on to a
> later part of the code. If I add a PetscBarrier() after the VecNorm(), I
> find that the program hangs.
>
> The funny thing is that part of the code duplicates the Vec with
> VecDuplicate() and assigns to the duplicated vector the result of some
> computations. The duplicated Vec has the same layout as the original Vec,
> but taking VecNorm() on the duplicated Vec works fine. If I use VecCopy(),
> however, the copied Vec also causes VecNorm() to hang. I've printed out the
> original Vec, and there are no corrupted/NaN entries.
>
> I have a temporary workaround where I perturb the original Vec slightly
> before copying it to another Vec. This causes the program to successfully
> terminate.
>
> Any advice on how to get VecNorm() working with the original Vec?
>
> Thanks,
> Sreeram
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231117/3e7a24c4/attachment-0001.html>


More information about the petsc-users mailing list