[petsc-dev] Adding support memkind allocators in PETSc

Tue Apr 28 11:35:24 CDT 2015

Richard Mills <rtm at utk.edu> writes:

>> I'm at a loss for words to express how disgusting this is.
>>
>
> Ha ha!  Yeah, I don't like it either.  Chris and I were just thinking about
> what we could do if we wanted to not break the existing API.  

But it DOES BREAK THE EXISTING API!  If you make this change, ALL
EXISTING CODE IS BROKEN and yet broken in a way that the compiler cannot
warn about.  This is literally the worst possible thing.

>> What did Chris say when you asked him about making memkind "suck less"?
>> (Using shorthand to avoid retyping my previous long emails with
>> constructive suggestions.)
>>
>
> I had some pretty good discussions with Chris.  He's a very reasonable guy,
> actually (and unfortunately has just moved to another project, so someone
> else is going to have to take over memkind ownership).  I summarize the
> main points (the ones I can recall, anyway) below:
>
> 1) Easy one first: Regarding my wish for a call to accurately query the
> amount of available high-bandwidth memory (MCDRAM), there is currently a
> memkind_get_size() API but it has the shortcomings of being expensive and
> not taking into account the heap's free pool (just the memory that the OS
> knows to be available).  It should be possible to get around the expense of
> the call with some caching and to include the free pool accounting.  Don't
> know if any work has been done on this one, yet.

I don't think this is very useful for multi-process or threaded code
(i.e., all code that might run on KNL) due to race conditions.  Suppose
that 1% of processes get allocation kinds mixed up due to the race
condition and then run 5x slower for the memory-bound phases of the
application.  Have fun load balancing that.  If you want reproducible
performance and/or avoid this load balancing disaster, you need to
either solve the packing problem in a deterministic way or you need to
adaptively modify the policy so that you can fix the low-quality
allocations due to race conditions.

> 2) Regarding the desire to be able to move pages between kinds of memory
> while keeping the same virtual address:  This is tough to implement in a
> way that will give decent performance.  I guess that what we'd really like
> to have would be an API like
>
>   int memkind_convert(memkind_t kind, void *ptr, size_t size);
>
> but the problem with the above is that is if the physical backing of a
> virtual address is being changed, then a POSIX system call has to be made.

This interface is too fine-grained in my opinion.

> Linux provides the mbind(2) and move_pages(2) system calls that enable the
> user to modify the backing physical pages of virtual address ranges within
> the NUMA architecture, so these can be used to move physical pages between
> NUMA nodes (and high bandwidth on-package memory will be treated as a NUMA
> node).  (A user on a KNL system could actually use move_pages(2) to move
> between DRAM and MCDRAM, I believe.)  

Really?  That's what I'm asking for.

> But Linux doesn't provide an equivalent way for a user to change the
> page size of the backing physical pages of an address range, so it's
> not possible to implement the above memkind_convert() with what Linux
> currently provides.

For small allocations, it doesn't matter where the memory is located
because it's either in cache or it's not.  From what I hear, KNL's
MCDRAM won't improve latency, so all such allocations may as well go in
DRAM anyway.  So all I care about are substantial allocations, like
matrix and vector data.  It's not expensive to allocate those the align
with page boundaries (provided they are big enough; coarse grids don't
matter).

> If we want to move data from one memory kind to another, I believe that we
> need to be able to deal with the virtual address changing.  

That is a regression relative to move_pages.  Just make move_pages work.
That's the granularity I've been asking for all along.

> Yes, this is a pain because extra bookkeeping is involved.  Maybe we
> don't want to bother with supporting something like this in PETSc.
> But I don't know of any good way around this.  I have discussed with
> Chris the idea of adding support for asynchronously copying pages
> between different kinds of memory (maybe have a memdup() analog to
> strdup()) and he had some ideas about how this might be done
> efficiently.  But, again, I don't know of a good way to move data to a
> different memory kind while keeping the same virtual address.  If I'm
> misunderstanding something about what is possible with Linux (or other
> *nix), please let me know--I'd really like to be wrong on this.

Moving memory at page granularity is all you can do.  The hardware
doesn't support virtual-physical mapping at different granularity, so
there is no way to preserve address without affecting everything sharing
that page.  But "memkinds" only matter for large allocations.

Is it a showstopper to have different addresses and do full copies?
It's more of a mess with threads (requires extra
synchronization/coordination), but it's sometimes (maybe often)
feasible.  It's certainly ugly and a debugging nightmare (e.g., you'll
set a location watchpoint and not see where it was modified because it
was copied out to a different kind).  We'll also need a system for
eviction.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150428/97f1f3d2/attachment.sig>