[petsc-dev] Using multiple mallocs with PETSc

Sat Mar 11 14:25:33 CST 2017

> On Mar 11, 2017, at 1:11 PM, Jed Brown <jed at jedbrown.org> wrote:
> 
> Barry Smith <bsmith at mcs.anl.gov> writes:
> 
>>> On Mar 10, 2017, at 1:52 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>> 
>>> 
>>>  Now I have read through all the old email from Jed to understand why he hates memkind so much.
>>> 
>> 
>>   I have read through all the emails and admit I still don't understand anything. I have pasted all the juicier bits at the bottom.  The final exchange between Jeff and Jed was 
>> 
>> --------------------------------------------------------
>> Jeff - If the pattern is so trivial, then PETSc should be able to observe it and
>> memcpy pages between MCDRAM and DDR4.
>> 
>> Jed -
>> The difference is that memcpy changes the virtual address, which would
>> require non-local rewiring (in some cases).
>> 
>> jeff - Your argument all along is that it is just too hard for PETSc to do
>> anything intelligent with user data, and yet you think Linux somehow does
>> better using only the VM context.
>> 
>> Jed - My argument was always that memory placement is not a *local* decision.
>> The memkind interface is static, so you either make a static local
>> decision or build some dynamicism around it.  But even if you build the
>> dynamicism, it's still a mess (at best) to collect the *non-local*
>> information needed to make an accurate decision.  Moreover, even with a
>> global view up to the present, but lacking clairvoyance (ability to
>> prove as-yet-unknown theorems), you cannot determine what memory is
>> "hottest".  Of course it's trivial in retrospect if you can profile _the
>> specific user configuration_.
>> 
>> -----
>> 
>> Jeff - Show me a profile-guided Linux page-migration implementation.
>> 
>> Jed - 
>> Automatic NUMA balancing has existed in Linux since 3.8, though the
>> algorithms have been improved over the years.  This figure shows it
>> working as well as manual tuning in the non-oversubscribed cases.
>> 
>> http://rhelblog.redhat.com/2015/01/12/mysteries-of-numa-memory-management-revealed/
>> 
>> My understanding of the current logic is that it assumes that moving
>> memory from one NUMA node to another involves moving it further away
>> From some cores.  So the algorithm may need tuning for KNL, but
>> non-oversubscribed long-running scientific workloads are a pretty easy
>> case.
>> 
>> -------------------------------------------------------------------
>> 
>> My interpretation of this (and the other emails) is Jed thinks memkind serves no purpose as an API to be used by PETSc or other applications/libraries because you don't know when mallocing any particular thing enough global information that will tell you what type of memory you should put it in. Instead one should use a "page migration" system to move pages between memory systems based on profile information as the beast runs.
>> 
>> Jed, is this at all accurate, if not could you please phrase what you believe in a couple of sentences?
> 
> I think it's accurate in the sense that the performance of real
> applications using a page migration system will be sufficiently close to
> the best manual page mapping strategy that nobody should bother with the
> manual system.

   Will such a page migration system ever exist, is Intel working hard on it for KNL? What if no one provides such a page migration system? Should we just wait around until they do (which they won't) and do nothing else instead? Or will we have to do a half-assed hacky thing to work around the lack of the mythical decent page migration system?

> 
>> But Jed softens a little bit with 
>> 
>> --------
>> Richard -- I really like Barry's proposal to add this context.  I can think of other
>> things that could go into that context, too, like hints about how the
>> memory will be used (passed to the OS via madvise(2), for instance).  
>> 
>> Jed - I like this better.  And "memkind" should really be an enhancement to
>> posix_madvise.
> 
> madvise says how the memory will be used, not where to place it, and it
> can be called by a different module than allocates the memory.
> 
>> --------
>> 
>> This, to me, indicates that Jed believes at least sometimes a person
>> allocating something
> 
> madvise can be called by any user of the memory, not just the code that
> allocates it.
> 
>> sometimes does have an idea of how the memory is to be used and thus
>> there should be a way to provide that information. But he hates the
>> idea of the person DECIDING what memory to use, he only wants them to
>> be able to provide advice.
>> 
>> Regarding the "page migration" Jed writes
>> 
>> --------
>> 
>> Jed -- Or ignore all this nonsense [memkind], implement move_pages(), and we'll have PETSc
>> track accesses so we can balance the pages once the app gets going.
>> 
>> --------
>> 
>> Jed, I have trouble understanding how this would be much difference in performance than just use HBW memory as cache (i.e. the Intel cache mode)?
> 
> In cache mode, accessing infrequently-used memory (like TS trajectory)
> evicts memory that you will use again soon.

   What if you could advise the malloc system that this chunk of memory should not be cached? Though this appears to be impossible by design?

> 
>> -------------------------------------------------------------
>> 
>> 
>> Based on the previous emails I would guess Jed hates my proposal that started this email thread. But if I change it to 
>> 
>> PetscMallocPushAdvise()
>> 
>> he might be open to it?
> 
> That would be fine.
> 
>> Note that if the array used for vectors (and for matrices also, but more complicated) had in its malloc header also a count of usage we could do things like
> 
> The State counter is basically this.
> 
>> VecGetArray(Vec x,Petsc **xarray)
>> 
>> *a = *((PetscScalar**)x->data);
>> PetscMallocTrackUsage(*a); 
>> 
>> and PetscMallocTrackUsage(void *ptr) could increment the usage counter by one, if the array is in slow memory and there is room in the faster memory and the "count is high enough relative to other counts, perhaps weighted by length)" it could move it over to the fast memory. Getting Jed's migration. Of course kicking things out of fast memory due to lack of use would be more difficult, but not impossible. Likely one would track only a relatively small number of mallocs(), most small ones don't need to be tracked.
>> 
>> 
>>  Comments? Jed has kept a low profile.
> 
> Some days I do nothing but review proposals and applications and ...
> 
>> All juicy bits from the emails I found.
>> ------------------------------------------------------------
>> 
>> 
>> 
>> 
>> Jed - 1. Has there been any progress on improving the memkind interface so
>>  that allocation can be made based on local information?  The present
>>  interface requires global information to decide where to allocate the
>>  memory; that is a horrid abstraction that will seriously disrupt
>>  software modularity and workflow, and leave a lot of applications
>>  with terrible utilization of MCDRAM.
>> 
>> ------
>> 
>> Barry -    MPI_Comm argument?  PETSc users rarely need to call PetscMalloc()
>>   themselves and if they do call it then they should know the
>>   properties of the memory they are allocating. Most users won't
>>   even notice the change.
>> 
>> Jed -- I think that's an exaggeration, but what are you going to use for the
>> "kind" parameter?  The "correct" value depends on a ton of non-local
>> information.
>> 
>> Barry -  Note that I'd like to add this argument independent of memkind.
>> 
>> Jed -- What are you going to use it for?  If the allocation is small enough,
>> it'll probably be resident in cache and if it falls out, the lower
>> latency to DRAM will be better than HBM.  As it gets bigger, provided it
>> gets enough use, then HBM becomes the right place, but later it's too
>> big and you have to go back to DRAM. What happens if memory of the kind requested is unavailable?  Error or
>> the implementations tries to find a different kind?  If there are
>> several memory kinds, what order is used when checking?
>> 
>> --------
>> 
>> Richard -- I really like Barry's proposal to add this context.  I can think of other
>> things that could go into that context, too, like hints about how the
>> memory will be used (passed to the OS via madvise(2), for instance).  
>> 
>> Jed - I like this better.  And "memkind" should really be an enhancement to
>> posix_madvise.
>> 
>> ------
>> 
>> Richard -- I think many users are going to want more control than what something like
>> AutoHBW provides, but, as you say, a lot of the time one will only care
>> about the the substantial allocations for things like matrices and vectors,
>> and these also tend to be long lived--plenty of codes will do something
>> like allocate a matrix for Jacobians once and keep it around for the
>> lifetime of the run.  Maybe we should consider not using a heap manager for
>> these allocations, then.  For allocations above some specified threshold,
>> perhaps we (PETSc) should simply do the appropriate mmap() and mbind()
>> calls to allocate the pages we need in the desired type of memory, and then
>> we could use things like use move_pages() if/when appropriate (yes, I know
>> we don't yet have a good way to make such decisions).  This would mean
>> PETSc getting more into the lower level details of memory management, but
>> maybe this is appropriate (an unavoidable) as more kinds of
>> user-addressable memory get introduced.  I think is actually less horrible
>> than it sounds, because, really, we would just want to do this for the
>> largest allocations.  (And this is somewhat analogous to how many malloc()
>> implementations work, anyway: Use sbrk() for the small stuff, and mmap()
>> for the big stuff.)
>> 
>> Jed -- 
>> I say just use malloc (or posix_memalign) for everything.  PETSc can't
>> do a better job of the fancy stuff and these normal functions are
>> perfectly sufficient.
>> 
>> --------
>> 
>> Richard -- hbw_preferred_str = (char *)memkind_malloc(MEMKIND_HBW_PREFERRED, size);
>> 
>> How much would you prefer it?  If we stupidly ask for HBM in VecCreate_*
>> and MatCreate_*, then our users will see catastrophic performance drops
>> at magic sizes and will have support questions like "I swapped these two
>> independent lines and my code ran 5x faster".  Then they'll hack the
>> source by writing
>> 
>> if (moon_is_waxing() && operator_holding_tongue_in_right_cheek()) {
>>   policy = MEMKIND_HBW_PREFERRED;
>> }
>> 
>> eventually making all decisions based on nonlocal information, ignoring
>> the advice parameter.
>> 
>> Then they'll get smart and register their own malloc so they don't have
>> to hack the library.  Then they'll try to couple their application with
>> another that does the same thing and now they have to write a new malloc
>> that makes a new set of decisions in light of the fact that multiple
>> libraries are being coupled.
>> 
>> I think we can agree that this is madness.  Where do you draw the line
>> and say that crappy performance is just reality?
>> 
>> It's hard for me not to feel like the proposed system will be such a
>> nightmarish maintenance burden with such little benefit over a simple
>> size-based allocation that it would be better for everyone if it doesn't
>> exist.
>> 
>> For example, we've already established that small allocations should
>> generally go in DRAM because they're either cached or not prefetched and
>> thus limited by latency instead of bandwidth.  Large allocations that
>> get used a lot should go in HBM so long as they fit.  Since we can't
>> determine "used a lot" or "fit" from any information possibly available
>> in the calling scope, there's literally no useful advice we can provide
>> at that point.  So don't try, just set a dumb threshold (crude tuning
>> parameter) or implement a profile-guided allocation policy (brittle).
>> 
>> Jed -- Or ignore all this nonsense, implement move_pages(), and we'll have PETSc
>> track accesses so we can balance the pages once the app gets going.
>> 
>> 
>> ------
>> Richard --- I'd like to be able to restrict this to only the PETSc portion: Maybe
>> a code that uses PETSc also needs to allocate some enormous lookup
>> tables that are big but have accesses that are really latency- rather
>> than bandwidth-sensitive.  Or, to be specific to a code I actually
>> know, I believe that in PFLOTRAN there are some pretty large
>> allocations required for auxiliary variables that don't need to go in
>> high-bandwidth memory, though we will want all of the large PETSc
>> objects to go in there.
>> 
>> Jed --- Fine.  That involves a couple lines of code.  Go into PetscMallocAlign
>> and add the ability to use memkind.  Add a run-time option to control
>> the threshold.  Done.
>> 
>> If you want complexity to bleed into the library (and necessarily into
>> user code if given any power at all), I think you need to demonstrate a
>> tangible benefit that cannot be obtained by something simpler.  Consider
>> the simple and dumb threshold above to be the null hypothesis.
>> 
>> This is just my opinion.  Feel free to make a branch with whatever you
>> prefer.
>> 
>> --------
>> Bary -- Perhaps, and this is just nonsense off the top of my head, if you
>> had some measure of the importance of a vector (or matrix; I would
>> start with vectors for simplicity and since we have more of them)
>> based on how often it's values would be "accessed". So a vector that
>> you know is only used "once in a while" gets a lower "importance"
>> than one that gets used "very often". Of course determining these
>> vectors importances may be difficult. You could do it
>> experimentally, add some code that measures how often each vector
>> gets its values "accessed (whatever that means)/read write" and see
>> if there is some distribution (do this for a nontrivial TS example)
>> where some vectors are accessed often and others rarely. 
>> 
>> Jed ---
>> This is what I termed profile-guided and it's very accurate (you have
>> global space-time information), but super brittle when
>> resource-constrained.
>> 
>> Note that in case of Krylov solvers, the first vectors in the Krylov
>> space are accessed far more than later vectors (e.g., the 30th vector is
>> accessed once per 30 iterations versus the first vector which is
>> accessed every iteration).  Simple greedy allocation is great for this
>> case.
>> 
>> It's terrible in other cases, a simple case of which is two solvers
>> where the first is cheap (or solved only rarely) and the second is
>> solved repeatedly at great expense.  Nested solvers are one such
>> example.  But you don't know which one is more expensive except in
>> retrospect, and this can even change as nonlinearities evolve.
>> 
>> --------
>> 
>> Jeff -  Jeff Hammond <jeff.science at gmail.com> writes:
>> The beauty of git/github is one can make branches to try out anything
>> they want even if Jed thinks that he knows better than Intel how to
>> write system software for Intel's hardware.
>> 
>> Jed ---
>> I'm objecting to the interface.  I think that if they try to get memkind
>> merged into the existing libnuma project, they'll see similar
>> resistance.  It is essential for low-level interfaces to create
>> foundations that can be reliably built upon, not gushing wounds that
>> bleed complexity into everything built on top.
>> 
>> 
>> 1. I cannot test it because I don't have access to the hardware.
>> 
>> 2. I think memkind is solving the wrong problem in the wrong way.
>> 
>> 3. According to Richard, the mature move_pages(2) interface has been
>> implemented.  That's what I wanted, so I'll just use that -- memkind
>> dependency gone.
>> 
>> ---------
>> 
>> Jeff --- The memkind library itself was developed entirely without access to
>> the hardware to which you refer, so this complaint is not relevant.
>> 
>> Jed ---- The interesting case here is testing failure modes in the face of
>> resource exhaustion, which doesn't seem to have been addressed in a
>> serious way by memkind and requires other trickery to test without
>> MCDRAM.  Also, the performance effects are relevant.  But I don't want
>> anything else in memkind because I don't want to use memkind for
>> anything ever.
>> 
>> 2. I think memkind is solving the wrong problem in the wrong way.
>> 
>> Jeff - It is more correct to say it is solving a different problem than the
>> one you care about.  memkind is the correct way to solve the problem
>> it is trying to solve.  Please stop equating your disagreement with
>> the problem statement as evidence that the solution is terrible.
>> 
>> Jed - This is pedantry.  Is there a clear statement of what problem memkind
>> solves?
>> 
>> Jeff -  The memkind library is a user extensible heap manager built on top of
>> jemalloc which enables control of memory characteristics and a
>> partitioning of the heap between kinds of memory.
>> 
>> Jed - This is just a low-level statement about what it does and I would argue
>> it doesn't even do this in a useful way because it is entirely at
>> allocation time assuming the caller is omniscient.
>> 
>> Jeff - 3. According to Richard, the mature move_pages(2) interface has been
>> implemented.  That's what I wanted, so I'll just use that -- memkind
>> dependency gone.
>> 
>> Does this mean that you will stop complaining about memkind, since it
>> is not directly relevant to your life?  I would like that.
>> 
>> Jed -  Yes, as soon as people stop telling me that I should use memkind and
>> stop asking to put it into packages I interact with, I'll ignore it like
>> countless other projects that are irrelevant to what I do.  But if, like
>> OpenMP, the turd keeps landing in my breakfast, I'm probably going to
>> mention that it's impolite to keep pooping in my breakfast.
>> 
>> --------
>> 
>> Barry - It is OUR job as PETSc developers to hide that complexity from the
>> "most people" who would be driven away from HPC because of it. 
>> 
>> Jed - Absolutely.  So now the question becomes "what benefit can this have,
>> predicated on not letting the complexity bleed onto the user.
>> 
>> Barry -  Thus if Richard proposed changing VecCreate() to VecCreate(MPI_Comm,
>> Crazy Intel specific Memkind options, Vec *x); we would reject
>> it. He is not even coming close to proposing that, in fact he is not
>> proposing anything, he is just asking for advise on how to run some
>> experiments to see if the Phi crazy memory shit can be beneficial to
>> some PETSc apps.
>> 
>> Jed - And my advice is to start with the simplest thing possible.
>> 
>> I'm also expressing skepticism that a more sophisticated solution that
>> _does not bleed complexity on the user_ is capable of substantially
>> beating the simple thing across a meaningful range of applications.
>> 
>> -----
>> 
>> Jeff - If the pattern is so trivial, then PETSc should be able to observe it and
>> memcpy pages between MCDRAM and DDR4.
>> 
>> Jed -
>> The difference is that memcpy changes the virtual address, which would
>> require non-local rewiring (in some cases).
>> 
>> jeff - Your argument all along is that it is just too hard for PETSc to do
>> anything intelligent with user data, and yet you think Linux somehow does
>> better using only the VM context.
>> 
>> Jed - My argument was always that memory placement is not a *local* decision.
>> The memkind interface is static, so you either make a static local
>> decision or build some dynamicism around it.  But even if you build the
>> dynamicism, it's still a mess (at best) to collect the *non-local*
>> information needed to make an accurate decision.  Moreover, even with a
>> global view up to the present, but lacking clairvoyance (ability to
>> prove as-yet-unknown theorems), you cannot determine what memory is
>> "hottest".  Of course it's trivial in retrospect if you can profile _the
>> specific user configuration_.
>> 
>> -----
>> 
>> Jeff - Show me a profile-guided Linux page-migration implementation.
>> 
>> Jed - 
>> Automatic NUMA balancing has existed in Linux since 3.8, though the
>> algorithms have been improved over the years.  This figure shows it
>> working as well as manual tuning in the non-oversubscribed cases.
>> 
>> http://rhelblog.redhat.com/2015/01/12/mysteries-of-numa-memory-management-revealed/
>> 
>> My understanding of the current logic is that it assumes that moving
>> memory from one NUMA node to another involves moving it further away
>> From some cores.  So the algorithm may need tuning for KNL, but
>> non-oversubscribed long-running scientific workloads are a pretty easy
>> case.