[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Jed Brown jedbrown at mcs.anl.gov
Sat Oct 6 22:59:31 CDT 2012


On Sat, Oct 6, 2012 at 10:22 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:

> It could be that this "single virtually linear piece of memory" is so
> tempting and so delicious that it fogs our brains and prevents us from
> coming up with the better, more comprehensive, more amendable to
> heterogeneous computing model.
>
>      The single virtual linear piece of memory model allows us to address
> "local" entries and "ghost" entries needed in the computation using the
> same same indexing WITHOUT having to pack ghost regions (note Jed just
> discussed this in a just received email). With distributed memory one
> seemingly must pack ghost regions. I'd like the same "kernel" model in both
> cases but still giving good performance always.  So one could have
>
> 1)  announce in advance needed ghost points (maybe a no-op on the single
> virtually linear piece of memory hardware (or maybe some prefetching), do
> the VecScatter in PETSc in distributed memory, do something with multiple
> GPUs)
> 2) do computations, perhaps first the part that does not need ghost
> points, then the part that does
>
>    Note if we were real OS hackers we could play with the virtual address
> space directly so that he packed ghost point region (which may be physical
> memory NOT next to the "local" part of the vector) is addressed in a
> unified address space to the "local" part. For example, trying to address
> part of the "ghost region to be sent from another processor" prematurely
> (that is after it was requested from other nodes but before it arrived)
> could simply result in a blocked memory operation thus allowing Jed's
> virtual linear piece of memory model in the distributed world, hence the
> same kernel C code for both multithread and MPI with always the best
> performance.
>

The last time we discussed this, I thought we came to the conclusion that
this would leave lots of holes in the memory space, leading to very poor
memory utilization.


>
>     BTW: do GPU's use the usual "virtual memory addressing" for the big
> GPU memory or is it directly hardware addressed? (It sounds like at least
> when the GPU and CPU have "unified memory" it must be virtual?)
>

I thought it was just offset mapped. Note that GPU "shared memory" is
private to a thread block and not shared between SMs.


>      Virtual addressing gives us one level of indirection "for free" (and
> we always have to pay for it even if we ignore it), so why not try to take
> advantage of it?
>

Some caveats:

1. You only get virtual address indirection at page granularity. Linux now
uses "transparent huge pages", so you can have 2 or 4MB pages. That is too
coarse for vector partitioning. Actually, if we're going to partition
vectors across NUMA domains, we may have to ask users to disable
transparent huge pages, then use libhugetlbfs (
http://lwn.net/Articles/375096/) to allocate with huge pages when it makes
sense (e.g. matrices). Note that using smaller pages gives up a lot of TLB
entries.

   cache and TLB information (2):
      0x5a: data TLB: 2M/4M pages, 4-way, 32 entries
      0x03: data TLB: 4K pages, 4-way, 64 entries

2. Systems like Blue Gene do not have virtual memory in the usual sense.
Addresses are just offset-mapped. I think this model is unsustainable for
NUMA and I'm skeptical that IBM's next machine after /Q will preserve
uniform access to DRAM.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20121006/ef09a3f6/attachment.html>


More information about the petsc-dev mailing list