[petsc-dev] Supporting OpenCL matrix assembly

Tue Sep 24 03:48:17 CDT 2013

Hey,

> Perhaps that *GetSource method should also return an opaque device "Mat"
> pointer that the user is responsible for shepherding into the kernel
>  From which they call the device MatSetValues?

This is easy of the OpenCL management is within PETSc (i.e. context, 
buffers and command queues managed by us). I expect that a bunch of 
users wants to provide their own context and stuff, which would require 
us to offer something like
   MatAttachOpenCLEnvironment(Mat,cl_context,cl_command_queue);
for all the matrix and vector objects involved. Note that this needs to 
be attached before the matrix is created. I think this is doable.

>> b)
>> Other than that, I'm not sure whether I understand the semantics of the
>> proposed function correctly. In order for MatOpenCLGetSetValuesSource()
>> to be callable by device threads,
>
> The *GetSource method would be called from the CPU and would return a
> string containing the implementation of a type-specialized MatSetValues
> implementation.  The user would prepend its source to the string they
> pass to the OpenCL compiler.  Their own part of that string would
> contain code that calls MatSetValues (perhaps with a name that makes it
> clear that it's running on the device).

Ok, that makes more sense. :-)

> Suppose the column indices have been set in advance.  Now if the
> application already has a way of preventing conflicted cross-threadblock
> writes to those slots within an insertion round (e.g., coloring), PETSc
> would not need any synchronization and wouldn't need to stash
> possibly-conflicted writes elsewhere.  Otherwise, PETSc would have to
> manage the stashing, use atomics, or some other scheme.

I see. My experience is that synchronizations, particularly atomics, are 
usually too expensive on GPUs if one wants to compete with an optimized 
CPU implimentation. Coloring is often reasonable, but the price to pay 
are bad strong scaling properties, because each color induces a ~10us 
kernel launch overhead. Either way, that's a reasonable implementation 
approach to start with.

Best regards,
Karli