[petsc-dev] error with karlrupp/fix-cuda-streams

Sat Sep 28 13:29:32 CDT 2019

On Sat, Sep 28, 2019 at 12:55 AM Karl Rupp <rupp at iue.tuwien.ac.at> wrote:

> Hi Mark,
>
> > OK, so now the problem has shifted somewhat in that it now manifests
> > itself on small cases.

It is somewhat random and anecdotal but it does happen on the smaller test
problem now. When I try to narrow down when the problem manifests by
reducing the number of GPUs/procs the problem can not be too small (ie, the
bug does not manifest on even smaller problems).

But it is much more stable and there does seem to be only this one problem
with mat-transpose-mult. You made a lot of progress.

> In earlier investigation I was drawn to
> > MatTranspose but had a hard time pinning it down. The bug seems more
> > stable now or you probably fixed what looks like all the other bugs.
> >
> > I added print statements with norms of vectors in mg.c (v-cycle) and
> > found that the diffs between the CPU and GPU runs came in MatRestrict,
> > which calls MatMultTranspose. I added identical print statements in the
> > two versions of MatMultTranspose and see this. (pinning to the CPU does
> > not seem to make any difference). Note that the problem comes in the 2nd
> > iteration where the *output* vector is non-zero coming in (this should
> > not matter).
> >
> > Karl, I zeroed out the output vector (yy) when I come into this method
> > and it fixed the problem. This is with -n 4, and this always works with
> > -n 3. See the attached process layouts. It looks like this comes when
> > you use the 2nd socket.
> >
> > So this looks like an Nvidia bug. Let me know what you think and I can
> > pass it on to ORNL.
>
> Hmm, there were some issues with MatMultTranspose_MPIAIJ at some point.
> I've addressed some of them, but I can't confidently say that all of the
> issues were fixed. Thus, I don't think it's a problem in NVIDIA's
> cuSparse, but rather something we need to fix in PETSc. Note that the
> problem shows up with multiple MPI ranks;

It seems to need to use two sockets. My current test works with 1,2, and 3
GPUs (one socket) but fails with 4, when you go to the second socket.

> if it were a problem in
> cuSparse, it would show up on a single rank as well.
>

What I am seeing is consistent with CUSPARSE having a race condition in
zeroing out the output vector in some way, But I don't know.

>
> Best regards,
> Karli
>
>
>
>
>
> > 06:49  /gpfs/alpine/geo127/scratch/adams$ jsrun*-n 4 *-a 4 -c 4 -g 1
> > ./ex56 -cells 8,12,16 *-ex56_dm_vec_type cuda -ex56_dm_mat_type
> aijcusparse*
> > [0] 3465 global equations, 1155 vertices
> > [0] 3465 equations in vector, 1155 vertices
> >    0 SNES Function norm 1.725526579328e+01
> >      0 KSP Residual norm 1.725526579328e+01
> >          2) call Restrict with |r| = 1.402719214830704e+01
> >                          MatMultTranspose_MPIAIJCUSPARSE |x in| =
> > 1.40271921483070e+01
> > *                        MatMultTranspose_MPIAIJ |y in| =
> > 0.00000000000000e+00
> > *                        MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
> > 0.00000000000000e+00
> >                          *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
> > 3.43436359545813e+00
> >                          MatMultTranspose_MPIAIJCUSPARSE final |yy| =
> > 1.29055494844681e+01
> >                  3) |R| = 1.290554948446808e+01
> >          2) call Restrict with |r| = 4.109771717986951e+00
> >                          MatMultTranspose_MPIAIJCUSPARSE |x in| =
> > 4.10977171798695e+00
> > *                        MatMultTranspose_MPIAIJ |y in| =
> > 0.00000000000000e+00
> > *                        MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
> > 0.00000000000000e+00
> >                          *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
> > 1.79415048609144e-01
> >                          MatMultTranspose_MPIAIJCUSPARSE final |yy| =
> > 9.01083013948788e-01
> >                  3) |R| = 9.010830139487883e-01
> >                  4) |X| = 2.864698671963022e+02
> >                  5) |x| = 9.763280000911783e+02
> >                  6) post smooth |x| = 8.940011621494751e+02
> >                  4) |X| = 8.940011621494751e+02
> >                  5) |x| = 1.005081556495388e+03
> >                  6) post smooth |x| = 1.029043994031627e+03
> >      1 KSP Residual norm 8.102614049404e+00
> >          2) call Restrict with |r| = 4.402603749876137e+00
> >                          MatMultTranspose_MPIAIJCUSPARSE |x in| =
> > 4.40260374987614e+00
> > *                        MatMultTranspose_MPIAIJ |y in| =
> > 1.29055494844681e+01
> > *                        MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
> > 0.00000000000000e+00
> >                          *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
> > 1.68544559626318e+00
> >                          MatMultTranspose_MPIAIJCUSPARSE final |yy| =
> > 1.82129824300863e+00
> >                  3) |R| = 1.821298243008628e+00
> >          2) call Restrict with |r| = 1.068309793900564e+00
> >                          MatMultTranspose_MPIAIJCUSPARSE |x in| =
> > 1.06830979390056e+00
> >                          MatMultTranspose_MPIAIJ |y in| =
> > 9.01083013948788e-01
> >                          MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
> > 0.00000000000000e+00
> >                          *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
> > 1.40519177065298e-01
> >                          MatMultTranspose_MPIAIJCUSPARSE final |yy| =
> > 1.01853904152812e-01
> >                  3) |R| = 1.018539041528117e-01
> >                  4) |X| = 4.949616392884510e+01
> >                  5) |x| = 9.309440014159884e+01
> >                  6) post smooth |x| = 5.432486021529479e+01
> >                  4) |X| = 5.432486021529479e+01
> >                  5) |x| = 8.246142532204632e+01
> >                  6) post smooth |x| = 7.605703654091440e+01
> >    Linear solve did not converge due to DIVERGED_ITS iterations 1
> > Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations
> 0
> > 06:50  /gpfs/alpine/geo127/scratch/adams$ jsrun -n 4 -a 4 -c 4 -g 1
> > ./ex56 -cells 8,12,16
> > [0] 3465 global equations, 1155 vertices
> > [0] 3465 equations in vector, 1155 vertices
> >    0 SNES Function norm 1.725526579328e+01
> >      0 KSP Residual norm 1.725526579328e+01
> >          2) call Restrict with |r| = 1.402719214830704e+01
> >                          MatMultTranspose_MPIAIJ |x in| =
> > 1.40271921483070e+01
> > *                        MatMultTranspose_MPIAIJ |y in| =
> > 0.00000000000000e+00
> > *                        MatMultTranspose_MPIAIJ |a->lvec| =
> > 0.00000000000000e+00
> >                          *** MatMultTranspose_MPIAIJ |yy| =
> > 3.43436359545813e+00
> >                          MatMultTranspose_MPIAIJ final |yy| =
> > 1.29055494844681e+01
> >                  3) |R| = 1.290554948446809e+01
> >          2) call Restrict with |r| = 4.109771717986956e+00
> >                          MatMultTranspose_MPIAIJ |x in| =
> > 4.10977171798696e+00
> > *                        MatMultTranspose_MPIAIJ |y in| =
> > 0.00000000000000e+00
> > *                        MatMultTranspose_MPIAIJ |a->lvec| =
> > 0.00000000000000e+00
> >                          *** MatMultTranspose_MPIAIJ |yy| =
> > 1.79415048609143e-01
> >                          MatMultTranspose_MPIAIJ final |yy| =
> > 9.01083013948789e-01
> >                  3) |R| = 9.010830139487889e-01
> >                  4) |X| = 2.864698671963023e+02
> >                  5) |x| = 9.763280000911785e+02
> >                  6) post smooth |x| = 8.940011621494754e+02
> >                  4) |X| = 8.940011621494754e+02
> >                  5) |x| = 1.005081556495388e+03
> >                  6) post smooth |x| = 1.029043994031627e+03
> >      1 KSP Residual norm 8.102614049404e+00
> >          2) call Restrict with |r| = 4.402603749876139e+00
> >                          MatMultTranspose_MPIAIJ |x in| =
> > 4.40260374987614e+00
> > *                        MatMultTranspose_MPIAIJ |y in| =
> > 1.29055494844681e+01
> > *                        MatMultTranspose_MPIAIJ |a->lvec| =
> > 0.00000000000000e+00
> >                          *** MatMultTranspose_MPIAIJ |yy| =
> > 4.43650979822523e-01
> >                          MatMultTranspose_MPIAIJ final |yy| =
> > 1.18089369006243e+00
> >                  3) |R| = 1.180893690062426e+00
> >          2) call Restrict with |r| = 6.868764720156294e-01
> >                          MatMultTranspose_MPIAIJ |x in| =
> > 6.86876472015629e-01
> >                          MatMultTranspose_MPIAIJ |y in| =
> > 9.01083013948789e-01
> >                          MatMultTranspose_MPIAIJ |a->lvec| =
> > 0.00000000000000e+00
> >                          *** MatMultTranspose_MPIAIJ |yy| =
> > 3.36768099045088e-02
> >                          MatMultTranspose_MPIAIJ final |yy| =
> > 6.40334376876017e-02
> >                  3) |R| = 6.403343768760170e-02
> >                  4) |X| = 2.380471873599142e+01
> >                  5) |x| = 6.932703848368443e+01
> >                  6) post smooth |x| = 4.502536862656444e+01
> >                  4) |X| = 4.502536862656444e+01
> >                  5) |x| = 7.998534854728734e+01
> >                  6) post smooth |x| = 7.660075651381680e+01
> >    Linear solve did not converge due to DIVERGED_ITS iterations 1
> > Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations
> 0
> > 06:50  /gpfs/alpine/geo127/scratch/adams$
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190928/3d09fdc2/attachment-0001.html>