[petsc-dev] error with karlrupp/fix-cuda-streams

Mark Adams mfadams at lbl.gov
Fri Sep 27 06:15:41 CDT 2019


On Thu, Sep 26, 2019 at 10:18 AM Balay, Satish <balay at mcs.anl.gov> wrote:

> Mark,
>
> The branch karlrupp/fix-cuda-streams is already merged to master. [and
> the branch is now deleted]
>

OK, so now the problem has shifted somewhat in that it now manifests itself
on small cases. In earlier investigation I was drawn to MatTranspose but
had a hard time pinning it down. The bug seems more stable now or you
probably fixed what looks like all the other bugs.

I added print statements with norms of vectors in mg.c (v-cycle) and found
that the diffs between the CPU and GPU runs came in MatRestrict, which
calls MatMultTranspose. I added identical print statements in the two
versions of MatMultTranspose and see this. (pinning to the CPU does not
seem to make any difference). Note that the problem comes in the 2nd
iteration where the *output* vector is non-zero coming in (this should not
matter).

Karl, I zeroed out the output vector (yy) when I come into this method and
it fixed the problem. This is with -n 4, and this always works with -n 3.
See the attached process layouts. It looks like this comes when you use the
2nd socket.

So this looks like an Nvidia bug. Let me know what you think and I can pass
it on to ORNL.

Thanks,
Mark

06:49  /gpfs/alpine/geo127/scratch/adams$ jsrun* -n 4 *-a 4 -c 4 -g 1
./ex56 -cells 8,12,16 *-ex56_dm_vec_type cuda -ex56_dm_mat_type aijcusparse*
[0] 3465 global equations, 1155 vertices
[0] 3465 equations in vector, 1155 vertices
  0 SNES Function norm 1.725526579328e+01
    0 KSP Residual norm 1.725526579328e+01
        2) call Restrict with |r| = 1.402719214830704e+01
                        MatMultTranspose_MPIAIJCUSPARSE |x in| =
1.40271921483070e+01

*                        MatMultTranspose_MPIAIJ |y in| =
0.00000000000000e+00*
MatMultTranspose_MPIAIJCUSPARSE |a->lvec| = 0.00000000000000e+00
                        *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
3.43436359545813e+00
                        MatMultTranspose_MPIAIJCUSPARSE final |yy| =
1.29055494844681e+01
                3) |R| = 1.290554948446808e+01
        2) call Restrict with |r| = 4.109771717986951e+00
                        MatMultTranspose_MPIAIJCUSPARSE |x in| =
4.10977171798695e+00

*                        MatMultTranspose_MPIAIJ |y in| =
0.00000000000000e+00*
MatMultTranspose_MPIAIJCUSPARSE |a->lvec| = 0.00000000000000e+00
                        *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
1.79415048609144e-01
                        MatMultTranspose_MPIAIJCUSPARSE final |yy| =
9.01083013948788e-01
                3) |R| = 9.010830139487883e-01
                4) |X| = 2.864698671963022e+02
                5) |x| = 9.763280000911783e+02
                6) post smooth |x| = 8.940011621494751e+02
                4) |X| = 8.940011621494751e+02
                5) |x| = 1.005081556495388e+03
                6) post smooth |x| = 1.029043994031627e+03
    1 KSP Residual norm 8.102614049404e+00
        2) call Restrict with |r| = 4.402603749876137e+00
                        MatMultTranspose_MPIAIJCUSPARSE |x in| =
4.40260374987614e+00

*                        MatMultTranspose_MPIAIJ |y in| =
1.29055494844681e+01*
MatMultTranspose_MPIAIJCUSPARSE |a->lvec| = 0.00000000000000e+00
                        *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
1.68544559626318e+00
                        MatMultTranspose_MPIAIJCUSPARSE final |yy| =
1.82129824300863e+00
                3) |R| = 1.821298243008628e+00
        2) call Restrict with |r| = 1.068309793900564e+00
                        MatMultTranspose_MPIAIJCUSPARSE |x in| =
1.06830979390056e+00
                        MatMultTranspose_MPIAIJ |y in| =
9.01083013948788e-01
                        MatMultTranspose_MPIAIJCUSPARSE |a->lvec| =
0.00000000000000e+00
                        *** MatMultTranspose_MPIAIJCUSPARSE |yy| =
1.40519177065298e-01
                        MatMultTranspose_MPIAIJCUSPARSE final |yy| =
1.01853904152812e-01
                3) |R| = 1.018539041528117e-01
                4) |X| = 4.949616392884510e+01
                5) |x| = 9.309440014159884e+01
                6) post smooth |x| = 5.432486021529479e+01
                4) |X| = 5.432486021529479e+01
                5) |x| = 8.246142532204632e+01
                6) post smooth |x| = 7.605703654091440e+01
  Linear solve did not converge due to DIVERGED_ITS iterations 1
Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0
06:50  /gpfs/alpine/geo127/scratch/adams$ jsrun -n 4 -a 4 -c 4 -g 1 ./ex56
-cells 8,12,16
[0] 3465 global equations, 1155 vertices
[0] 3465 equations in vector, 1155 vertices
  0 SNES Function norm 1.725526579328e+01
    0 KSP Residual norm 1.725526579328e+01
        2) call Restrict with |r| = 1.402719214830704e+01
                        MatMultTranspose_MPIAIJ |x in| =
1.40271921483070e+01

*                        MatMultTranspose_MPIAIJ |y in| =
0.00000000000000e+00*                        MatMultTranspose_MPIAIJ
|a->lvec| = 0.00000000000000e+00
                        *** MatMultTranspose_MPIAIJ |yy| =
3.43436359545813e+00
                        MatMultTranspose_MPIAIJ final |yy| =
1.29055494844681e+01
                3) |R| = 1.290554948446809e+01
        2) call Restrict with |r| = 4.109771717986956e+00
                        MatMultTranspose_MPIAIJ |x in| =
4.10977171798696e+00

*                        MatMultTranspose_MPIAIJ |y in| =
0.00000000000000e+00*                        MatMultTranspose_MPIAIJ
|a->lvec| = 0.00000000000000e+00
                        *** MatMultTranspose_MPIAIJ |yy| =
1.79415048609143e-01
                        MatMultTranspose_MPIAIJ final |yy| =
9.01083013948789e-01
                3) |R| = 9.010830139487889e-01
                4) |X| = 2.864698671963023e+02
                5) |x| = 9.763280000911785e+02
                6) post smooth |x| = 8.940011621494754e+02
                4) |X| = 8.940011621494754e+02
                5) |x| = 1.005081556495388e+03
                6) post smooth |x| = 1.029043994031627e+03
    1 KSP Residual norm 8.102614049404e+00
        2) call Restrict with |r| = 4.402603749876139e+00
                        MatMultTranspose_MPIAIJ |x in| =
4.40260374987614e+00

*                        MatMultTranspose_MPIAIJ |y in| =
1.29055494844681e+01*                        MatMultTranspose_MPIAIJ
|a->lvec| = 0.00000000000000e+00
                        *** MatMultTranspose_MPIAIJ |yy| =
4.43650979822523e-01
                        MatMultTranspose_MPIAIJ final |yy| =
1.18089369006243e+00
                3) |R| = 1.180893690062426e+00
        2) call Restrict with |r| = 6.868764720156294e-01
                        MatMultTranspose_MPIAIJ |x in| =
6.86876472015629e-01
                        MatMultTranspose_MPIAIJ |y in| =
9.01083013948789e-01
                        MatMultTranspose_MPIAIJ |a->lvec| =
0.00000000000000e+00
                        *** MatMultTranspose_MPIAIJ |yy| =
3.36768099045088e-02
                        MatMultTranspose_MPIAIJ final |yy| =
6.40334376876017e-02
                3) |R| = 6.403343768760170e-02
                4) |X| = 2.380471873599142e+01
                5) |x| = 6.932703848368443e+01
                6) post smooth |x| = 4.502536862656444e+01
                4) |X| = 4.502536862656444e+01
                5) |x| = 7.998534854728734e+01
                6) post smooth |x| = 7.660075651381680e+01
  Linear solve did not converge due to DIVERGED_ITS iterations 1
Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0
06:50  /gpfs/alpine/geo127/scratch/adams$
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190927/7b2fa8ac/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 4PGUs.png
Type: image/png
Size: 174720 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190927/7b2fa8ac/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 3GPUs.png
Type: image/png
Size: 176390 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20190927/7b2fa8ac/attachment-0003.png>


More information about the petsc-dev mailing list