[petsc-dev] Adding support memkind allocators in PETSc

Tue Apr 28 07:24:36 CDT 2015

  PetscObject x; It would be problematic if the address x ever changed because copies of that address could be stored all over the place (as references to that object) but for the data within an object, such as the array of numerical values in a vector or matrix, or indices in IS, etc there is generally only a single copy of that address so (except when a Get is outstanding) so at least in theory that memory can be swapped around without effecting the user (ahh the power of abstraction :-).  You could write some very simple test code such as a VecChangeMemory() that allocates new array space and copies the values over, or you can even do as we do with GPUs and have multiple array spaces allocated (in different kinds) and have VecGetArray() depending on "something" return pointers to different ones.

  Barry

> On Apr 28, 2015, at 1:38 AM, Richard Mills <rtm at utk.edu> wrote:
> 
> On Mon, Apr 27, 2015 at 12:38 PM, Jed Brown <jed at jedbrown.org> wrote:
> Richard Mills <rtm at utk.edu> writes:
> > I think it is possible to add the memkind support without breaking all of
> > the interfaces used throughout PETSc for PetscMalloc(), etc.  I recently
> > sat with Chris Cantalupo, the main memkind developer, and walked him
> > through PETSc's allocation routines, and we came up with the following: The
> > imalloc() function pointer could have an implementation something like
> >
> > PetcErrorCode PetscMemkindMalloc(size_t size, const char *func, const char
> > *file, void **result)
> >
> > {
> >
> >     struct memkind *kind;
> >
> >     int err;
> >
> >
> >
> >     if (*result == NULL) {
> >
> >         kind = MEMKIND_DEFAULT;
> >
> >     }
> >
> >     else {
> >
> >         kind = (struct memkind *)(*result);
> 
> I'm at a loss for words to express how disgusting this is.
> 
> Ha ha!  Yeah, I don't like it either.  Chris and I were just thinking about what we could do if we wanted to not break the existing API.  But one of my favorite things about PETSc is that developers are never afraid to make wholesale changes to things.
>  
> 
> > This gives us (1) a method of passing the kind of memory without modifying
> > the petsc allocation routine calling sequence,
> 
> Nonsense, it just dodges the compiler's ability to tell you about the
> memory errors that it creates at every place where PetscMalloc is
> called!
> 
> 
> What did Chris say when you asked him about making memkind "suck less"?
> (Using shorthand to avoid retyping my previous long emails with
> constructive suggestions.)
>  
> I had some pretty good discussions with Chris.  He's a very reasonable guy, actually (and unfortunately has just moved to another project, so someone else is going to have to take over memkind ownership).  I summarize the main points (the ones I can recall, anyway) below:
> 
> 1) Easy one first: Regarding my wish for a call to accurately query the amount of available high-bandwidth memory (MCDRAM), there is currently a memkind_get_size() API but it has the shortcomings of being expensive and not taking into account the heap's free pool (just the memory that the OS knows to be available).  It should be possible to get around the expense of the call with some caching and to include the free pool accounting.  Don't know if any work has been done on this one, yet.
> 
> 2) Regarding the desire to be able to move pages between kinds of memory while keeping the same virtual address:  This is tough to implement in a way that will give decent performance.  I guess that what we'd really like to have would be an API like
> 
>   int memkind_convert(memkind_t kind, void *ptr, size_t size);
> 
> but the problem with the above is that is if the physical backing of a virtual address is being changed, then a POSIX system call has to be made.  This also means that a heap management system tracking properties of virtual address ranges for reuse after freeing will require *making a system call to query the properties at the time of the free*.  This kills a lot of the reason for using a heap manager in the first place: avoiding the expense of repeated system calls (otherwise we'd just use mmap() for everything) by reusing memory already obtained from the kernel.
> 
> Linux provides the mbind(2) and move_pages(2) system calls that enable the user to modify the backing physical pages of virtual address ranges within the NUMA architecture, so these can be used to move physical pages between NUMA nodes (and high bandwidth on-package memory will be treated as a NUMA node).  (A user on a KNL system could actually use move_pages(2) to move between DRAM and MCDRAM, I believe.)  But Linux doesn't provide an equivalent way for a user to change the page size of the backing physical pages of an address range, so it's not possible to implement the above memkind_convert() with what Linux currently provides.
> 
> If we want to move data from one memory kind to another, I believe that we need to be able to deal with the virtual address changing.  Yes, this is a pain because extra bookkeeping is involved.  Maybe we don't want to bother with supporting something like this in PETSc.  But I don't know of any good way around this.  I have discussed with Chris the idea of adding support for asynchronously copying pages between different kinds of memory (maybe have a memdup() analog to strdup()) and he had some ideas about how this might be done efficiently.  But, again, I don't know of a good way to move data to a different memory kind while keeping the same virtual address.  If I'm misunderstanding something about what is possible with Linux (or other *nix), please let me know--I'd really like to be wrong on this.
> 
> Say that a library is eventually made available that can process all of the nonlocal information to make reasonable recommendations about where various data structures should be placed (or, hell, say that there is just an oracle we can consult about this), but there isn't a good way to do this while keeping the same virtual address.  Would this be a showstopper for using it in PETSc?  If not, how should we deal with it?  In my toy MMLIB ("memory malleability library") code I wrote during my dissertation work to handle "caching" data from disk in DRAM (for doing "memory adaptive" in-core/out-of-core computations), I broke a given data set down into "panels" of some user-determined granularity.  A particular data set was associated with an MMS object (and there was a registry that tracked all of the various MMSes), and when a user needed to work with a portion of the data set, he would call
> 
>   void *mmlib_get_panel(MMS mms, int p)
> 
> To get a pointer to the beginning of panel p, work with it a while, and then when it could be safely released, would call
> 
>   void *mmlib_release_panel(MMS, int p)
> 
> The the library would be free to evict the panel if necessary.  If it kept it cached, a subsequent request for the panel would return the same address, but if it was evicted and then later requested, another mmap() we be performed to get at the data and a different address would be returned.
> 
> My MMLIB library was really just a toy; the examples I looked at where pretty contrived; and the "panel" is perhaps the wrong granularity, but is an approach along these lines unworkable?  Or, rather: If we have to, how horrible would it be to need a pointer to a pointer inside, say, a Vec to get to the actual array of values?  If the array of values is being access by VecGetArray(), the address of this array is not allowed to be changed based on our oracle's recommendations until VecRestoreArray() is called.  If there is no outstanding VecGetArray(), then our oracle is free to change the address that the array actually "lives" at, and the next VecGetArray() might return a different address.  Can we deal with this, or are there terrible complications I'm not thinking of?  It *may* be possible to have some systems where we can move a data structure around through all kinds of memory and keep the same virtual addresses, but I think that there will certainly be systems on which this will NOT be possible, and I think this sort of consideration will become more common as more companies introduce different kinds of high-bandwidth memory, types of NVRAM, etc.  Yes, the proliferation of various kinds of user-addressable memory types is horrible from a certain perspective, but I don't think it can be avoided.
> 
> 
> > and (2) support a fall back code path legacy applications which will
> > not set the pointer to NULL.  Or am I missing something?
> 
>