[petsc-dev] Soliciting suggestions for linear solver work under SciDAC 4 Institutes

Jeff Hammond jeff.science at gmail.com
Fri Jul 8 11:40:54 CDT 2016

> > 1) How do we run at bandwidth peak on new architectures like Cori or
> Aurora?
>   Huh, there is a how here, not a why?
> >
> > Patrick and Rich have good suggestions here. Karl and Rich showed some
> promising numbers for KNL at the PETSc meeting.
> >
> >
> > Future systems from multiple vendors basically move from 2-tier memory
> hierarchy of shared LLC and DRAM to a 3-tier hierarchy of fast memory (e.g.
> HBM), regular memory (e.g. DRAM), and slow (likely nonvolatile) memory  on
> a node.
>   Jeff,
>    Would Intel sell me a system that had essentially no regular memory
> DRAM (which is too slow anyway) and no slow memory (which is absurdly too
> slow)?  What cost savings would I get in $ and power usage compared to say
> what is going in the theta? 10% and 20%, 5% and 30%, 5% and 5 %? If it is a
> significant savings then get the cut down machine, if it is insignificant
> than realize the cost of not using it (the DRAM you paid so little for) is
> insignificant and not worth worrying about, just like cruise control when
> you don't use the highway. Actually I could use the DRAM to store the
> history needed for the adjoints; so maybe it is ok to keep, but surely not
> useful for data that is continuously involved in the computation.

*Disclaimer: All of the following data is pulled off of the Internet, which
in some cases is horribly unreliable.  My comments are strictly for
academic discussion and not meant to be authoritative or have any influence
on purchasing or design decisions.  Do not equate quoted TDP to measured
power during any workload, or assume that different measurements can be
compared directly.*

Your thinking is in line with

Intel sells KNL packages as parts (
that don't have any DRAM in them, just MCDRAM.  It's the decision of the
integrator what goes into the system, which of course is correlated to what
the intended customer wants.  While you might not need a node with DRAM,
many users do, and the systems that DOE buys are designed to meet the needs
of their broad user base.

I don't know if KNL is bootable without no DRAM at all - this is likely
more to do with what motherboard, BIOS, etc. expect than the processor
package itself.  However, the KNL alltoall mode addresses the case where
DRAM channels are underpopulated (with fully populated channels, one should
use quadrant, hemisphere, SNC-2 or SNC-4), so if DRAM is necessary, you
should be able to boot it with only one channel populated.  Of course, if
you do this, you'll get 1/6 of the DDR4 bandwidth.

As to the question of DRAM power, there is a lot of detailed information
available (e.g.
https://lenovopress.com/lp0083.pdf) but since I am lazy, I'll use the
numbers reported on
for client memory (i.e. not server memory, hence probably not providing
ECC, but ECC doesn't change power consumption much), which works out to
0.37 W/GB for DDR4-2133, hence 71 W for 192 GB [
That 71W is ~1/3 of the processor package power (215W).  The network
adapter draws some power, and the cables and switches (especially optics)
are a nontrivial power draw.  So DRAM is at most 25% of the node power, and
perhaps ~17% of system power based upon what I can derive from Shaheen II.

Shaheen II Cray XC40
1.96 MW = 6174 * (2 sockets * 135 W/socket + 128 GB * 0.37 W/GB)
2.83 MW total
= 69% from CPU+DRAM

Again, *these are not the exact numbers* but what I can derive from
https://www.hpc.kaust.edu.sa/content/shaheen-ii and

Back to the higher level analysis, what is unfortunate about DRAM is that
it needs power to hold data even if the data isn't used, because it is not
persistent.  I don't know how well it powers down when the physical memory
isn't mapped but it seems that power is not gated today [
http://digitalpiglet.org/research/sion2014socc.pdf].  The advantage of
nonvolatile memory is that it doesn't require power when not being
accessed, whether or not the data is preserved.

I suspect that nonvolatile memory (NVM) is the right place to put your
adjoint matrices, provided the NVM bandwidth is sufficient.

*Disclaimer: All of these are academic comments.  Do not use them to try to
influence others or make any decisions.  Do your own research and be
skeptical of everything I derived from the Internet.*


