[petsc-dev] Using multiple mallocs with PETSc

Fri Mar 10 17:06:53 CST 2017

> On Mar 10, 2017, at 1:52 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> 
> 
>   Now I have read through all the old email from Jed to understand why he hates memkind so much.
> 

   I have read through all the emails and admit I still don't understand anything. I have pasted all the juicier bits at the bottom.  The final exchange between Jeff and Jed was 

--------------------------------------------------------
Jeff - If the pattern is so trivial, then PETSc should be able to observe it and
memcpy pages between MCDRAM and DDR4.

Jed -
The difference is that memcpy changes the virtual address, which would
require non-local rewiring (in some cases).

jeff - Your argument all along is that it is just too hard for PETSc to do
anything intelligent with user data, and yet you think Linux somehow does
better using only the VM context.

Jed - My argument was always that memory placement is not a *local* decision.
The memkind interface is static, so you either make a static local
decision or build some dynamicism around it.  But even if you build the
dynamicism, it's still a mess (at best) to collect the *non-local*
information needed to make an accurate decision.  Moreover, even with a
global view up to the present, but lacking clairvoyance (ability to
prove as-yet-unknown theorems), you cannot determine what memory is
"hottest".  Of course it's trivial in retrospect if you can profile _the
specific user configuration_.

-----

Jeff - Show me a profile-guided Linux page-migration implementation.

Jed - 
Automatic NUMA balancing has existed in Linux since 3.8, though the
algorithms have been improved over the years.  This figure shows it
working as well as manual tuning in the non-oversubscribed cases.

http://rhelblog.redhat.com/2015/01/12/mysteries-of-numa-memory-management-revealed/

My understanding of the current logic is that it assumes that moving
memory from one NUMA node to another involves moving it further away
From some cores.  So the algorithm may need tuning for KNL, but
non-oversubscribed long-running scientific workloads are a pretty easy
case.

-------------------------------------------------------------------

My interpretation of this (and the other emails) is Jed thinks memkind serves no purpose as an API to be used by PETSc or other applications/libraries because you don't know when mallocing any particular thing enough global information that will tell you what type of memory you should put it in. Instead one should use a "page migration" system to move pages between memory systems based on profile information as the beast runs.

Jed, is this at all accurate, if not could you please phrase what you believe in a couple of sentences?

But Jed softens a little bit with 

--------
Richard -- I really like Barry's proposal to add this context.  I can think of other
things that could go into that context, too, like hints about how the
memory will be used (passed to the OS via madvise(2), for instance).  

Jed - I like this better.  And "memkind" should really be an enhancement to
posix_madvise.

--------

This, to me, indicates that Jed believes at least sometimes a person allocating something sometimes does have an idea of how the memory is to be used and thus there should be a way to provide that information. But he hates the idea of the person DECIDING what memory to use, he only wants them to be able to provide advice.

Regarding the "page migration" Jed writes

--------

Jed -- Or ignore all this nonsense [memkind], implement move_pages(), and we'll have PETSc
track accesses so we can balance the pages once the app gets going.

--------

Jed, I have trouble understanding how this would be much difference in performance than just use HBW memory as cache (i.e. the Intel cache mode)?

-------------------------------------------------------------

Based on the previous emails I would guess Jed hates my proposal that started this email thread. But if I change it to 

PetscMallocPushAdvise()

he might be open to it?

Note that if the array used for vectors (and for matrices also, but more complicated) had in its malloc header also a count of usage we could do things like

VecGetArray(Vec x,Petsc **xarray)

*a = *((PetscScalar**)x->data);
PetscMallocTrackUsage(*a); 

and PetscMallocTrackUsage(void *ptr) could increment the usage counter by one, if the array is in slow memory and there is room in the faster memory and the "count is high enough relative to other counts, perhaps weighted by length)" it could move it over to the fast memory. Getting Jed's migration. Of course kicking things out of fast memory due to lack of use would be more difficult, but not impossible. Likely one would track only a relatively small number of mallocs(), most small ones don't need to be tracked.

  Comments? Jed has kept a low profile.

Barry

All juicy bits from the emails I found.
------------------------------------------------------------

Jed - 1. Has there been any progress on improving the memkind interface so
  that allocation can be made based on local information?  The present
  interface requires global information to decide where to allocate the
  memory; that is a horrid abstraction that will seriously disrupt
  software modularity and workflow, and leave a lot of applications
  with terrible utilization of MCDRAM.

------

Barry -    MPI_Comm argument?  PETSc users rarely need to call PetscMalloc()
   themselves and if they do call it then they should know the
   properties of the memory they are allocating. Most users won't
   even notice the change.

Jed -- I think that's an exaggeration, but what are you going to use for the
"kind" parameter?  The "correct" value depends on a ton of non-local
information.

Barry -  Note that I'd like to add this argument independent of memkind.

Jed -- What are you going to use it for?  If the allocation is small enough,
it'll probably be resident in cache and if it falls out, the lower
latency to DRAM will be better than HBM.  As it gets bigger, provided it
gets enough use, then HBM becomes the right place, but later it's too
big and you have to go back to DRAM. What happens if memory of the kind requested is unavailable?  Error or
the implementations tries to find a different kind?  If there are
several memory kinds, what order is used when checking?

--------

Richard -- I really like Barry's proposal to add this context.  I can think of other
things that could go into that context, too, like hints about how the
memory will be used (passed to the OS via madvise(2), for instance).  

Jed - I like this better.  And "memkind" should really be an enhancement to
posix_madvise.

------

Richard -- I think many users are going to want more control than what something like
AutoHBW provides, but, as you say, a lot of the time one will only care
about the the substantial allocations for things like matrices and vectors,
and these also tend to be long lived--plenty of codes will do something
like allocate a matrix for Jacobians once and keep it around for the
lifetime of the run.  Maybe we should consider not using a heap manager for
these allocations, then.  For allocations above some specified threshold,
perhaps we (PETSc) should simply do the appropriate mmap() and mbind()
calls to allocate the pages we need in the desired type of memory, and then
we could use things like use move_pages() if/when appropriate (yes, I know
we don't yet have a good way to make such decisions).  This would mean
PETSc getting more into the lower level details of memory management, but
maybe this is appropriate (an unavoidable) as more kinds of
user-addressable memory get introduced.  I think is actually less horrible
than it sounds, because, really, we would just want to do this for the
largest allocations.  (And this is somewhat analogous to how many malloc()
implementations work, anyway: Use sbrk() for the small stuff, and mmap()
for the big stuff.)

Jed -- 
I say just use malloc (or posix_memalign) for everything.  PETSc can't
do a better job of the fancy stuff and these normal functions are
perfectly sufficient.

--------

Richard -- hbw_preferred_str = (char *)memkind_malloc(MEMKIND_HBW_PREFERRED, size);

How much would you prefer it?  If we stupidly ask for HBM in VecCreate_*
and MatCreate_*, then our users will see catastrophic performance drops
at magic sizes and will have support questions like "I swapped these two
independent lines and my code ran 5x faster".  Then they'll hack the
source by writing

 if (moon_is_waxing() && operator_holding_tongue_in_right_cheek()) {
   policy = MEMKIND_HBW_PREFERRED;
 }

eventually making all decisions based on nonlocal information, ignoring
the advice parameter.

Then they'll get smart and register their own malloc so they don't have
to hack the library.  Then they'll try to couple their application with
another that does the same thing and now they have to write a new malloc
that makes a new set of decisions in light of the fact that multiple
libraries are being coupled.

I think we can agree that this is madness.  Where do you draw the line
and say that crappy performance is just reality?

It's hard for me not to feel like the proposed system will be such a
nightmarish maintenance burden with such little benefit over a simple
size-based allocation that it would be better for everyone if it doesn't
exist.

For example, we've already established that small allocations should
generally go in DRAM because they're either cached or not prefetched and
thus limited by latency instead of bandwidth.  Large allocations that
get used a lot should go in HBM so long as they fit.  Since we can't
determine "used a lot" or "fit" from any information possibly available
in the calling scope, there's literally no useful advice we can provide
at that point.  So don't try, just set a dumb threshold (crude tuning
parameter) or implement a profile-guided allocation policy (brittle).

Jed -- Or ignore all this nonsense, implement move_pages(), and we'll have PETSc
track accesses so we can balance the pages once the app gets going.

------
Richard --- I'd like to be able to restrict this to only the PETSc portion: Maybe
a code that uses PETSc also needs to allocate some enormous lookup
tables that are big but have accesses that are really latency- rather
than bandwidth-sensitive.  Or, to be specific to a code I actually
know, I believe that in PFLOTRAN there are some pretty large
allocations required for auxiliary variables that don't need to go in
high-bandwidth memory, though we will want all of the large PETSc
objects to go in there.

Jed --- Fine.  That involves a couple lines of code.  Go into PetscMallocAlign
and add the ability to use memkind.  Add a run-time option to control
the threshold.  Done.

If you want complexity to bleed into the library (and necessarily into
user code if given any power at all), I think you need to demonstrate a
tangible benefit that cannot be obtained by something simpler.  Consider
the simple and dumb threshold above to be the null hypothesis.

This is just my opinion.  Feel free to make a branch with whatever you
prefer.

--------
Bary -- Perhaps, and this is just nonsense off the top of my head, if you
 had some measure of the importance of a vector (or matrix; I would
 start with vectors for simplicity and since we have more of them)
 based on how often it's values would be "accessed". So a vector that
 you know is only used "once in a while" gets a lower "importance"
 than one that gets used "very often". Of course determining these
 vectors importances may be difficult. You could do it
 experimentally, add some code that measures how often each vector
 gets its values "accessed (whatever that means)/read write" and see
 if there is some distribution (do this for a nontrivial TS example)
 where some vectors are accessed often and others rarely. 

Jed ---
This is what I termed profile-guided and it's very accurate (you have
global space-time information), but super brittle when
resource-constrained.

Note that in case of Krylov solvers, the first vectors in the Krylov
space are accessed far more than later vectors (e.g., the 30th vector is
accessed once per 30 iterations versus the first vector which is
accessed every iteration).  Simple greedy allocation is great for this
case.

It's terrible in other cases, a simple case of which is two solvers
where the first is cheap (or solved only rarely) and the second is
solved repeatedly at great expense.  Nested solvers are one such
example.  But you don't know which one is more expensive except in
retrospect, and this can even change as nonlinearities evolve.

--------

Jeff -  Jeff Hammond <jeff.science at gmail.com> writes:
The beauty of git/github is one can make branches to try out anything
they want even if Jed thinks that he knows better than Intel how to
write system software for Intel's hardware.

Jed ---
I'm objecting to the interface.  I think that if they try to get memkind
merged into the existing libnuma project, they'll see similar
resistance.  It is essential for low-level interfaces to create
foundations that can be reliably built upon, not gushing wounds that
bleed complexity into everything built on top.

1. I cannot test it because I don't have access to the hardware.

2. I think memkind is solving the wrong problem in the wrong way.

3. According to Richard, the mature move_pages(2) interface has been
implemented.  That's what I wanted, so I'll just use that -- memkind
dependency gone.

---------

Jeff --- The memkind library itself was developed entirely without access to
the hardware to which you refer, so this complaint is not relevant.

Jed ---- The interesting case here is testing failure modes in the face of
resource exhaustion, which doesn't seem to have been addressed in a
serious way by memkind and requires other trickery to test without
MCDRAM.  Also, the performance effects are relevant.  But I don't want
anything else in memkind because I don't want to use memkind for
anything ever.

2. I think memkind is solving the wrong problem in the wrong way.

Jeff - It is more correct to say it is solving a different problem than the
one you care about.  memkind is the correct way to solve the problem
it is trying to solve.  Please stop equating your disagreement with
the problem statement as evidence that the solution is terrible.

Jed - This is pedantry.  Is there a clear statement of what problem memkind
solves?

Jeff -  The memkind library is a user extensible heap manager built on top of
 jemalloc which enables control of memory characteristics and a
 partitioning of the heap between kinds of memory.

Jed - This is just a low-level statement about what it does and I would argue
it doesn't even do this in a useful way because it is entirely at
allocation time assuming the caller is omniscient.

Jeff - 3. According to Richard, the mature move_pages(2) interface has been
implemented.  That's what I wanted, so I'll just use that -- memkind
dependency gone.

Does this mean that you will stop complaining about memkind, since it
is not directly relevant to your life?  I would like that.

Jed -  Yes, as soon as people stop telling me that I should use memkind and
stop asking to put it into packages I interact with, I'll ignore it like
countless other projects that are irrelevant to what I do.  But if, like
OpenMP, the turd keeps landing in my breakfast, I'm probably going to
mention that it's impolite to keep pooping in my breakfast.

--------

Barry - It is OUR job as PETSc developers to hide that complexity from the
 "most people" who would be driven away from HPC because of it. 

Jed - Absolutely.  So now the question becomes "what benefit can this have,
predicated on not letting the complexity bleed onto the user.

Barry -  Thus if Richard proposed changing VecCreate() to VecCreate(MPI_Comm,
 Crazy Intel specific Memkind options, Vec *x); we would reject
 it. He is not even coming close to proposing that, in fact he is not
 proposing anything, he is just asking for advise on how to run some
 experiments to see if the Phi crazy memory shit can be beneficial to
 some PETSc apps.

Jed - And my advice is to start with the simplest thing possible.

I'm also expressing skepticism that a more sophisticated solution that
_does not bleed complexity on the user_ is capable of substantially
beating the simple thing across a meaningful range of applications.

-----

Jeff - If the pattern is so trivial, then PETSc should be able to observe it and
memcpy pages between MCDRAM and DDR4.

Jed -
The difference is that memcpy changes the virtual address, which would
require non-local rewiring (in some cases).

jeff - Your argument all along is that it is just too hard for PETSc to do
anything intelligent with user data, and yet you think Linux somehow does
better using only the VM context.

Jed - My argument was always that memory placement is not a *local* decision.
The memkind interface is static, so you either make a static local
decision or build some dynamicism around it.  But even if you build the
dynamicism, it's still a mess (at best) to collect the *non-local*
information needed to make an accurate decision.  Moreover, even with a
global view up to the present, but lacking clairvoyance (ability to
prove as-yet-unknown theorems), you cannot determine what memory is
"hottest".  Of course it's trivial in retrospect if you can profile _the
specific user configuration_.

-----

Jeff - Show me a profile-guided Linux page-migration implementation.

Jed - 
Automatic NUMA balancing has existed in Linux since 3.8, though the
algorithms have been improved over the years.  This figure shows it
working as well as manual tuning in the non-oversubscribed cases.

http://rhelblog.redhat.com/2015/01/12/mysteries-of-numa-memory-management-revealed/

My understanding of the current logic is that it assumes that moving
memory from one NUMA node to another involves moving it further away
From some cores.  So the algorithm may need tuning for KNL, but
non-oversubscribed long-running scientific workloads are a pretty easy
case.