On Sat, Oct 6, 2012 at 9:16 PM, Karl Rupp <span dir="ltr"><<a href="mailto:rupp@mcs.anl.gov" target="_blank">rupp@mcs.anl.gov</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div id=":1q">Still, one may 'fake' a std::deque by holding meta-information about the physical memory pages nevertheless, yet allocate a (virtually) linear piece of memory to keep compatibility with XYZGetArray(). This would allow some nice optimizations for the threading scheduler, as threads may operate on 'nearer' pages first.</div>

</blockquote><div><br></div><div>So we sort of have this in the affinity associated with the thread communicator. We expect first-touch semantics (I wish we could use Linux's libnuma to be more explicit) so the page mapping can be computed using the page size and affinity.</div>

<div><br></div><div>In my opinion, the critical issue of separate versus unified memory arises in kernels like MatMult. If we partition the rows of the matrix among threads (I don't want to assume a contiguous partition because I want to share cache, but this argument works anyway), we generally need access to parts of the vector that are not in our local segment. In general, we may need to access some entries all segments of the vector. If we are not going to expose vectors via contiguous storage, we will be obligated to partition our matrices into explicit column segments, add a level of indirection to vector access, or explicitly copy the ghost parts into thread-local buffers.</div>

<div><br></div><div>I don't like any of these options because of implementation complexity and because the ghost regions are much larger than the interiors as we strong scale. Note that the principle reason for using threads in the first place was to avoid needing these copies. If we were going to always copy into local buffers, we could use an MPI-type model all the way down. The defining feature of threads is that they can use a common address space.</div>

<div><br></div><div>MatMultTranspose is somewhat more complicated since the output space is overlapping.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div id=":1q"><div class="im"></div>

Btw: Shri's threading communicator is almost identical to the OpenCL model (with the latter having a few additional capabilities).<div class="im"><br></div></div></blockquote><div><br></div><div>Can you elaborate on this? It is still a good time to refine our threadcomm model.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":1q"><div class="im"></div>

A bit of a spoiler for the actual job runtime (more brainstorming than complete suggestions):<br>

I can imagine submitting the Vec, the IS, and the type of job to the scheduler, possibly including some hints on the type of operations to follow. One may even enforce a certain type of device here, even though this requires the scheduler to move the data in place first. In this way one can perform smaller tasks on the respective CPU core (if we keep track of affinity information), and offload larger tasks to an available accelerator if possible. (Note that this is the main reason why I don't want to hide buffers in library-specific derived classes of Vec). The scheduler can use simple heuristics on where to perform operations based on typical latencies (e.g. ~20us for a GPU kernel)</div>

</blockquote><div><br></div><div>Where does the scheduler go in this approach? Some users have a TS with a small number of degrees of freedom (e.g. reduced chemistry), so we want to keep "nothread" kernel launch down in the range similar to an indirect function call.</div>

<div><br></div><div>I think it would be good for the GPU subtype to have a threshold and fall back to a CPU implementation for small enough sizes, but I'm reluctant to have a scheduler looking at all Vec implementations.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":1q"><div class="im"></div>

Yes, we could/can. A single kernel launcher also allows for fusing kernels, e.g. matrix-vector-product followed by an inner product of the result vector with some other vector. As outlined above, asynchronous data movement could even be the default rather than the exception, except for cases where one gives control over the data to the outside by e.g. returning a pointer to the array. In such cases one would have first wait for all operations on all data to finish.<br>


<br>

The main concern in all that is the readiness of the user. Awareness for asynchronous operations keeps rising, yet I can imagine user code like<br>

<br>

 PetscScalar * data = VecXYZGetArray(v1); // flushes the queue suitably<br>

 data[0] = VecDot(v2, v3);                // enqueues VecDot<br>

 PetscScalar s = data[0];                 // VecDot may not be finished!<br>

<br>

where a pointer given away once undermines everything.</div></blockquote></div><br><div>This is why the VecDot kernel does not return a raw result. You can call VecDot_kernel() collectively from another kernel, but you pass in a PetscThreadReduction and only reduce it when you need the result.</div>

<div><br></div><div>If a non-kernel caller wants an asynchronous interface, they would use VecDotBegin() and VecDotEnd(). Note that several reductions (dot products, norms, etc) can be queued up with VecXXBegin() calls, yet only one reduction (optionally also asynchronous) is performed.</div>