[petsc-dev] Adding support memkind allocators in PETSc

Tue Apr 28 23:54:33 CDT 2015

On Tue, Apr 28, 2015 at 9:35 AM, Jed Brown <jed at jedbrown.org> wrote:

> Richard Mills <rtm at utk.edu> writes:

[...]
>
> Linux provides the mbind(2) and move_pages(2) system calls that enable the
> > user to modify the backing physical pages of virtual address ranges
> within
> > the NUMA architecture, so these can be used to move physical pages
> between
> > NUMA nodes (and high bandwidth on-package memory will be treated as a
> NUMA
> > node).  (A user on a KNL system could actually use move_pages(2) to move
> > between DRAM and MCDRAM, I believe.)
>
> Really?  That's what I'm asking for.
>

Yes, I am ~ 99% sure that this is the case, but I will double-check to make
sure.

>
> > But Linux doesn't provide an equivalent way for a user to change the
> > page size of the backing physical pages of an address range, so it's
> > not possible to implement the above memkind_convert() with what Linux
> > currently provides.
>
> For small allocations, it doesn't matter where the memory is located
> because it's either in cache or it's not.  From what I hear, KNL's
> MCDRAM won't improve latency, so all such allocations may as well go in
> DRAM anyway.  So all I care about are substantial allocations, like
> matrix and vector data.  It's not expensive to allocate those the align
> with page boundaries (provided they are big enough; coarse grids don't
> matter).
>

Yes, MCDRAM won't help with latency, only bandwidth, so for small
allocations it won't matter.  Following reasoning like what you have above,
a colleague on my team recently developed an "AutoHBW" tool for users who
don't want to modify their code at all.  A user can specify a size
threshold above which allocations should come from MCDRAM, and then the
tool interposes on the malloc() (or other allocator) calls to put the small
stuff in DRAM and the big stuff in MCDRAM.

I think many users are going to want more control than what something like
AutoHBW provides, but, as you say, a lot of the time one will only care
about the the substantial allocations for things like matrices and vectors,
and these also tend to be long lived--plenty of codes will do something
like allocate a matrix for Jacobians once and keep it around for the
lifetime of the run.  Maybe we should consider not using a heap manager for
these allocations, then.  For allocations above some specified threshold,
perhaps we (PETSc) should simply do the appropriate mmap() and mbind()
calls to allocate the pages we need in the desired type of memory, and then
we could use things like use move_pages() if/when appropriate (yes, I know
we don't yet have a good way to make such decisions).  This would mean
PETSc getting more into the lower level details of memory management, but
maybe this is appropriate (an unavoidable) as more kinds of
user-addressable memory get introduced.  I think is actually less horrible
than it sounds, because, really, we would just want to do this for the
largest allocations.  (And this is somewhat analogous to how many malloc()
implementations work, anyway: Use sbrk() for the small stuff, and mmap()
for the big stuff.)

>
> > If we want to move data from one memory kind to another, I believe that
> we
> > need to be able to deal with the virtual address changing.
>
> That is a regression relative to move_pages.  Just make move_pages work.
> That's the granularity I've been asking for all along.
>

Cannot practically be done using a heap manager system like memkind.  But
we can do this if we do our own mmap() calls, as discussed above.

>
> > Yes, this is a pain because extra bookkeeping is involved.  Maybe we
> > don't want to bother with supporting something like this in PETSc.
> > But I don't know of any good way around this.  I have discussed with
> > Chris the idea of adding support for asynchronously copying pages
> > between different kinds of memory (maybe have a memdup() analog to
> > strdup()) and he had some ideas about how this might be done
> > efficiently.  But, again, I don't know of a good way to move data to a
> > different memory kind while keeping the same virtual address.  If I'm
> > misunderstanding something about what is possible with Linux (or other
> > *nix), please let me know--I'd really like to be wrong on this.
>
> Moving memory at page granularity is all you can do.  The hardware
> doesn't support virtual-physical mapping at different granularity, so
> there is no way to preserve address without affecting everything sharing
> that page.  But "memkinds" only matter for large allocations.
>
> Is it a showstopper to have different addresses and do full copies?
> It's more of a mess with threads (requires extra
> synchronization/coordination), but it's sometimes (maybe often)
> feasible.  It's certainly ugly and a debugging nightmare (e.g., you'll
> set a location watchpoint and not see where it was modified because it
> was copied out to a different kind).  We'll also need a system for
> eviction.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150428/8d2ff0f7/attachment.html>