[petsc-dev] Adding support memkind allocators in PETSc

Tue Apr 28 01:38:36 CDT 2015

On Mon, Apr 27, 2015 at 12:38 PM, Jed Brown <jed at jedbrown.org> wrote:

> Richard Mills <rtm at utk.edu> writes:
> > I think it is possible to add the memkind support without breaking all of
> > the interfaces used throughout PETSc for PetscMalloc(), etc.  I recently
> > sat with Chris Cantalupo, the main memkind developer, and walked him
> > through PETSc's allocation routines, and we came up with the following:
> The
> > imalloc() function pointer could have an implementation something like
> >
> > PetcErrorCode PetscMemkindMalloc(size_t size, const char *func, const
> char
> > *file, void **result)
> >
> > {
> >
> >     struct memkind *kind;
> >
> >     int err;
> >
> >
> >
> >     if (*result == NULL) {
> >
> >         kind = MEMKIND_DEFAULT;
> >
> >     }
> >
> >     else {
> >
> >         kind = (struct memkind *)(*result);
>
> I'm at a loss for words to express how disgusting this is.
>

Ha ha!  Yeah, I don't like it either.  Chris and I were just thinking about
what we could do if we wanted to not break the existing API.  But one of my
favorite things about PETSc is that developers are never afraid to make
wholesale changes to things.

>
> > This gives us (1) a method of passing the kind of memory without
> modifying
> > the petsc allocation routine calling sequence,
>
> Nonsense, it just dodges the compiler's ability to tell you about the
> memory errors that it creates at every place where PetscMalloc is
> called!
>
>
> What did Chris say when you asked him about making memkind "suck less"?
> (Using shorthand to avoid retyping my previous long emails with
> constructive suggestions.)
>

I had some pretty good discussions with Chris.  He's a very reasonable guy,
actually (and unfortunately has just moved to another project, so someone
else is going to have to take over memkind ownership).  I summarize the
main points (the ones I can recall, anyway) below:

1) Easy one first: Regarding my wish for a call to accurately query the
amount of available high-bandwidth memory (MCDRAM), there is currently a
memkind_get_size() API but it has the shortcomings of being expensive and
not taking into account the heap's free pool (just the memory that the OS
knows to be available).  It should be possible to get around the expense of
the call with some caching and to include the free pool accounting.  Don't
know if any work has been done on this one, yet.

2) Regarding the desire to be able to move pages between kinds of memory
while keeping the same virtual address:  This is tough to implement in a
way that will give decent performance.  I guess that what we'd really like
to have would be an API like

  int memkind_convert(memkind_t kind, void *ptr, size_t size);

but the problem with the above is that is if the physical backing of a
virtual address is being changed, then a POSIX system call has to be made.
This also means that a heap management system tracking properties of
virtual address ranges for reuse after freeing will require *making a
system call to query the properties at the time of the free*.  This kills a
lot of the reason for using a heap manager in the first place: avoiding the
expense of repeated system calls (otherwise we'd just use mmap() for
everything) by reusing memory already obtained from the kernel.

Linux provides the mbind(2) and move_pages(2) system calls that enable the
user to modify the backing physical pages of virtual address ranges within
the NUMA architecture, so these can be used to move physical pages between
NUMA nodes (and high bandwidth on-package memory will be treated as a NUMA
node).  (A user on a KNL system could actually use move_pages(2) to move
between DRAM and MCDRAM, I believe.)  But Linux doesn't provide an
equivalent way for a user to change the page size of the backing physical
pages of an address range, so it's not possible to implement the above
memkind_convert() with what Linux currently provides.

If we want to move data from one memory kind to another, I believe that we
need to be able to deal with the virtual address changing.  Yes, this is a
pain because extra bookkeeping is involved.  Maybe we don't want to bother
with supporting something like this in PETSc.  But I don't know of any good
way around this.  I have discussed with Chris the idea of adding support
for asynchronously copying pages between different kinds of memory (maybe
have a memdup() analog to strdup()) and he had some ideas about how this
might be done efficiently.  But, again, I don't know of a good way to move
data to a different memory kind while keeping the same virtual address.  If
I'm misunderstanding something about what is possible with Linux (or other
*nix), please let me know--I'd really like to be wrong on this.

Say that a library is eventually made available that can process all of the
nonlocal information to make reasonable recommendations about where various
data structures should be placed (or, hell, say that there is just an
oracle we can consult about this), but there isn't a good way to do this
while keeping the same virtual address.  Would this be a showstopper for
using it in PETSc?  If not, how should we deal with it?  In my toy MMLIB
("memory malleability library") code I wrote during my dissertation work to
handle "caching" data from disk in DRAM (for doing "memory adaptive"
in-core/out-of-core computations), I broke a given data set down into
"panels" of some user-determined granularity.  A particular data set was
associated with an MMS object (and there was a registry that tracked all of
the various MMSes), and when a user needed to work with a portion of the
data set, he would call

  void *mmlib_get_panel(MMS mms, int p)

To get a pointer to the beginning of panel p, work with it a while, and
then when it could be safely released, would call

  void *mmlib_release_panel(MMS, int p)

The the library would be free to evict the panel if necessary.  If it kept
it cached, a subsequent request for the panel would return the same
address, but if it was evicted and then later requested, another mmap() we
be performed to get at the data and a different address would be returned.

My MMLIB library was really just a toy; the examples I looked at where
pretty contrived; and the "panel" is perhaps the wrong granularity, but is
an approach along these lines unworkable?  Or, rather: If we have to, how
horrible would it be to need a pointer to a pointer inside, say, a Vec to
get to the actual array of values?  If the array of values is being access
by VecGetArray(), the address of this array is not allowed to be changed
based on our oracle's recommendations until VecRestoreArray() is called.
If there is no outstanding VecGetArray(), then our oracle is free to change
the address that the array actually "lives" at, and the next VecGetArray()
might return a different address.  Can we deal with this, or are there
terrible complications I'm not thinking of?  It *may* be possible to have
some systems where we can move a data structure around through all kinds of
memory and keep the same virtual addresses, but I think that there will
certainly be systems on which this will NOT be possible, and I think this
sort of consideration will become more common as more companies introduce
different kinds of high-bandwidth memory, types of NVRAM, etc.  Yes, the
proliferation of various kinds of user-addressable memory types is horrible
from a certain perspective, but I don't think it can be avoided.

> > and (2) support a fall back code path legacy applications which will
> > not set the pointer to NULL.  Or am I missing something?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20150427/7dcd6530/attachment.html>