On Sat, Oct 6, 2012 at 8:51 AM, Karl Rupp <span dir="ltr"><<a href="mailto:rupp@mcs.anl.gov" target="_blank">rupp@mcs.anl.gov</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div id=":5gw">Hmm, I thought that spptr is some 'special pointer' as commented in Mat, but not supposed to be a generic pointer to a derived class' datastructure (spptr is only injected with #define PETSC_HAVE_CUSP).</div>
</blockquote><div><br></div><div>Look at cuspvecimpl.h, for example.</div><div><div><br></div><div>#undef __FUNCT__</div><div>#define __FUNCT__ "VecCUSPGetArrayReadWrite"</div><div>PETSC_STATIC_INLINE PetscErrorCode VecCUSPGetArrayReadWrite(Vec v, CUSPARRAY** a)</div>
<div>{</div><div> PetscErrorCode ierr;</div><div><br></div><div> PetscFunctionBegin;</div><div> *a = 0;</div><div> ierr = VecCUSPCopyToGPU(v);CHKERRQ(ierr);</div><div> *a = ((Vec_CUSP *)v->spptr)->GPUarray;</div>
<div> PetscFunctionReturn(0);</div><div>}</div></div><div><br></div><div>Vec is following the convention from Mat where spptr points to the Mat_UMFPACK, Mat_SuperLU, Mat_SeqBSTRM, etc, which hold the extra information needed for that derived class (note that "derivation" for Mat is typically done at run time, after the object has been created, due to a call to MatGetFactor).</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":5gw"><div class="im">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
With different devices, we could simply have a valid flag for each<br>
device. When someone does VecDevice1GetArrayWrite(), the flag for all<br>
other devices is marked invalid. When VecDevice2GetArrayRead() is<br>
called, the implementation copies from any valid device to device2.<br>
Packing all those flags as bits in a single int is perhaps convenient,<br>
but not necessary.<br>
</blockquote>
<br></div>
I think that the most common way of handling GPUs will be an overlapping decomposition of the host array, similar to how a vector is distributed via MPI (locally owned, writeable, vs ghost values with read-only). Assigning the full vector exclusively to just one device is more a single-GPU scenario rather than a multi-GPU use case.<br>
</div></blockquote><div><br></div><div>Okay, the matrix will have to partition itself. What is the advantage of having a single CPU process addressing multiple GPUs? Why not use different MPI processes? (We can have the MPI processes sharing a node create a subcomm so they can decide which process is driving which device.)</div>
<div> </div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":5gw"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im">
I think this stuff (which allows for segmenting the array on the device)<br>
can go in Vec_CUDA and Vec_OpenCL, basically just replacing the GPUarray<br>
member of Vec_CUSP. Why have a different PetscAcceleratorData struct?<br>
</div></blockquote>
<br>
If spptr is intended to be a generic pointer to data of the derived class, then this is also a possiblity. However, this would lead to<br>
Vec_CUDA, Vec_OpenCL, and Vec_CUDA_OpenCL, with the number of implementations rapidly increasing as one may eventually add other frameworks. The PetscAcceleratorData would essentially allow for a unification of Vec_CUDA, Vec_OpenCL, and Vec_CUDA_OpenCL, avoiding code duplication problems.</div>
</blockquote><div><br></div><div>How would the user decide which device they wanted computation to run on? (Also, Is OpenCL really the right name in an environment where there may be multiple devices using OpenCL?) Currently, the type indicates where native operations should "prefer" to compute, copying data there when necessary. The Vec operations have different implementations for CUDA and OpenCL so I don't see the problem with making them different derived classes. If we wanted a hybrid CUDA/OpenCL class, it would contain the logic for deciding where to do things followed by dispatch into the device-specific implementation, thus it doesn't seem like duplication to me.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":5gw"><div class="im"><br>
<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Here, the PetscXYZHandleDescriptor holds<br>
- the memory handle,<br>
- the device ID the handles are valid for, and<br>
- a flag whether the data is valid<br>
(cf. valid_GPU_array, but with a much finer granularity).<br>
Additional metainformation such as index ranges can be extended as<br>
needed, cf. Vec_Seq vs Vec_MPI. Different types<br>
Petsc*HandleDescriptors are expected to be required because the<br>
various memory handle types are not guaranteed to have a particular<br>
maximum size among different accelerator platforms.<br>
<br>
<br>
It sounds like you want to support marking only part of an array as<br>
stale. We could could keep one top-level (_p_Vec) flag indicating<br>
whether the CPU part was current, then in the specific implementation<br>
(Vec_OpenCL), you can hold finer granularity. Then when<br>
vec->ops->UpdateCPUArray() is called, you can look at the finer<br>
granularity flags to copy only what needs to be copied.<br>
<br>
</blockquote>
<br></div>
Yes, I also thought of such a top-level-flag. This is, however, rather an optimization flag (similar to what is done in VecGetArray for petscnative), so I refrained from a separate discussion.<br>
<br>
Aside from that, yes, I want to support parts of an array as stale, as the best multi-GPU use I've experienced so far is for block-based preconditioners (cf. Block-ILU-variants, parallel AMG flavors, etc.). A multi-GPU sparse matrix-vector product is handled similarly.</div>
</blockquote></div><br>