[petsc-users] Configuring PETSc for KNL

Matthew Knepley knepley at gmail.com
Wed Apr 5 12:00:07 CDT 2017


On Wed, Apr 5, 2017 at 11:54 AM, Zhang, Hong <hongzhang at anl.gov> wrote:

>
> > On Apr 5, 2017, at 10:53 AM, Jed Brown <jed at jedbrown.org> wrote:
> >
> > "Zhang, Hong" <hongzhang at anl.gov> writes:
> >
> >> On Apr 4, 2017, at 10:45 PM, Justin Chang <jychang48 at gmail.com<mailto:
> jychang48 at gmail.com>> wrote:
> >>
> >> So I tried the following options:
> >>
> >> -M 40
> >> -N 40
> >> -P 5
> >> -da_refine 1/2/3/4
> >> -log_view
> >> -mg_coarse_pc_type gamg
> >> -mg_levels_0_pc_type gamg
> >> -mg_levels_1_sub_pc_type cholesky
> >> -pc_type mg
> >> -thi_mat_type baij
> >>
> >> Performance improved dramatically. However, Haswell still beats out KNL
> but only by a little. Now it seems like MatSOR is taking some time (though
> I can't really judge whether it's significant or not). Attached are the log
> files.
> >>
> >>
> >> MatSOR takes only 3% of the total time. Most of the time is spent on
> PCSetUp (~30%) and PCApply (~11%).
> >
> > I don't see any of your conclusions in the actual data, unless you only
> > looked at the smallest size that Justin tested.  For example, from the
> > largest problem size in Justin's logs:
>
> My mistake. I did not see the results for the large problem sizes. I was
> talking about the data for the smallest case.
>
> Now I am very surprised by the performance of MatSOR:
>
> -da_refine 1 ~2x slower on KNL
> -da_refine 2 ~2x faster on KNL
> -da_refine 3 ~2x faster on KNL
> -da_refine 4 almost the same
>
> KNL
>
> -da_refine 1 MatSOR              1185 1.0 2.8965e-01 1.1 7.01e+07 1.0
> 0.0e+00 0.0e+00 0.0e+00  3 41  0  0  0   3 41  0  0  0 15231
> -da_refine 2 MatSOR              1556 1.0 1.6883e+00 1.0 5.82e+08 1.0
> 0.0e+00 0.0e+00 0.0e+00 11 44  0  0  0  11 44  0  0  0 22019
> -da_refine 3 MatSOR              2240 1.0 1.4959e+01 1.0 5.51e+09 1.0
> 0.0e+00 0.0e+00 0.0e+00 22 45  0  0  0  22 45  0  0  0 23571
> -da_refine 4 MatSOR              2688 1.0 2.3942e+02 1.1 4.47e+10 1.0
> 0.0e+00 0.0e+00 0.0e+00 36 45  0  0  0  36 45  0  0  0 11946
>
>
> Haswell
> -da_refine 1 MatSOR              1167 1.0 1.4839e-01 1.1 1.42e+08 1.0
> 0.0e+00 0.0e+00 0.0e+00  3 42  0  0  0   3 42  0  0  0 30450
> -da_refine 2 MatSOR              1532 1.0 2.9772e+00 1.0 1.17e+09 1.0
> 0.0e+00 0.0e+00 0.0e+00 28 44  0  0  0  28 44  0  0  0 12539
> -da_refine 3 MatSOR              1915 1.0 2.7142e+01 1.1 9.51e+09 1.0
> 0.0e+00 0.0e+00 0.0e+00 45 45  0  0  0  45 45  0  0  0 11216
> -da_refine 4 MatSOR              2262 1.0 2.2116e+02 1.1 7.56e+10 1.0
> 0.0e+00 0.0e+00 0.0e+00 48 45  0  0  0  48 45  0  0  0 10936
>

SOR should track memory bandwidth, so it seems to me either

  a) We fell out of MCDRAM

or

  b) We saturated the KNL node, but not the Haswell configuration

I think these are all runs with identical parallelism, so its not b).
Justin, did you tell it to fall back to DRAM, or fail?

  Thanks,

    Matt



> Hong (Mr.)
>
>
> > KNL:
> > MatSOR              2688 1.0 2.3942e+02 1.1 4.47e+10 1.0 0.0e+00 0.0e+00
> 0.0e+00 36 45  0  0  0  36 45  0  0  0 11946
> > KSPSolve               8 1.0 4.3837e+02 1.0 9.87e+10 1.0 1.5e+06 8.8e+03
> 5.0e+03 68 99 98 61 98  68 99 98 61 98 14409
> > SNESSolve              1 1.0 6.1583e+02 1.0 9.95e+10 1.0 1.6e+06 1.4e+04
> 5.1e+03 96100100100 99  96100100100 99 10338
> > SNESFunctionEval       9 1.0 3.8730e+01 1.0 0.00e+00 0.0 9.2e+03 3.2e+04
> 0.0e+00  6  0  1  1  0   6  0  1  1  0     0
> > SNESJacobianEval      40 1.0 1.5628e+02 1.0 0.00e+00 0.0 4.4e+04 2.5e+05
> 1.4e+02 24  0  3 49  3  24  0  3 49  3     0
> > PCSetUp               16 1.0 3.4525e+01 1.0 6.52e+07 1.0 2.8e+05 1.0e+04
> 3.8e+03  5  0 18 13 74   5  0 18 13 74   119
> > PCSetUpOnBlocks       60 1.0 9.5716e-01 1.1 1.41e+05 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > PCApply               60 1.0 3.8705e+02 1.0 9.32e+10 1.0 1.2e+06 8.0e+03
> 1.1e+03 60 94 79 45 21  60 94 79 45 21 15407
> > MatMult             2860 1.0 1.4578e+02 1.1 4.92e+10 1.0 1.2e+06 8.8e+03
> 0.0e+00 21 49 77 48  0  21 49 77 48  0 21579
> >
> > Haswell:
> > MatSOR              2262 1.0 2.2116e+02 1.1 7.56e+10 1.0 0.0e+00 0.0e+00
> 0.0e+00 48 45  0  0  0  48 45  0  0  0 10936
> > KSPSolve               7 1.0 3.5937e+02 1.0 1.67e+11 1.0 6.7e+05 1.3e+04
> 4.5e+03 81 99 98 60 98  81 99 98 60 98 14828
> > SNESSolve              1 1.0 4.3749e+02 1.0 1.68e+11 1.0 6.8e+05 2.1e+04
> 4.5e+03 99100100100 99  99100100100 99 12280
> > SNESFunctionEval       8 1.0 1.5460e+01 1.0 0.00e+00 0.0 4.1e+03 4.7e+04
> 0.0e+00  3  0  1  1  0   3  0  1  1  0     0
> > SNESJacobianEval      35 1.0 6.8994e+01 1.0 0.00e+00 0.0 1.9e+04 3.8e+05
> 1.3e+02 16  0  3 50  3  16  0  3 50  3     0
> > PCSetUp               14 1.0 1.0860e+01 1.0 1.15e+08 1.0 1.3e+05 1.4e+04
> 3.4e+03  2  0 19 13 74   2  0 19 13 74   335
> > PCSetUpOnBlocks       50 1.0 4.5601e-02 1.6 2.89e+05 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     6
> > PCApply               50 1.0 3.3545e+02 1.0 1.57e+11 1.0 5.3e+05 1.2e+04
> 9.7e+02 75 94 77 44 21  75 94 77 44 21 15017
> > MatMult             2410 1.0 1.2050e+02 1.1 8.28e+10 1.0 5.1e+05 1.3e+04
> 0.0e+00 27 49 75 46  0  27 49 75 46  0 21983
> >
> >> If ex48 has SSE2 intrinsics, does that mean Haswell would almost always
> be better?
> >>
> >> The Jacobian evaluation (which has SSE2 intrinsics) on Haswell is about
> two times as fast as on KNL, but it eats only 3%-4% of the total time.
> >
> > SNESJacobianEval alone accounts for 90 seconds of the 180 second
> > difference between KNL and Haswell.
> >
> >> According to your logs, the compute-intensive kernels such as MatMult,
> >> MatSOR, PCApply run faster (~2X) on Haswell.
> >
> > They run almost the same speed.
> >
> >> But since the setup time dominates in this test,
> >
> > It doesn't dominate on the larger sizes.
> >
> >> Haswell would not show much benefit. If you increase the problem size,
> >> it could be expected that the performance gap would also increase.
> >
> > Backwards.  Haswell is great for low latency on small problem sizes
> > while KNL offers higher theoretical throughput (often not realized due
> > to lack of vectorization) for sufficiently large problem sizes
> > (especially if they don't fit in Haswell L3 cache but do fit in MCDRAM).
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20170405/ff578e01/attachment.html>


More information about the petsc-users mailing list