[petsc-dev] SNES ex19 not using GPU despite passing the options

Tue Jan 14 16:49:59 CST 2014

Hi Mani,

 > Thanks for the reply. That fixed it. I get only a 10% speed up using the
> cusp options. Is the residual evaluation at each iteration happening on
> the CPU or the GPU?

The residual evaluation happens on the CPU unless there is a dedicated 
kernel provided for this (which is not the case in ex19)

> Is there anyway one can do the residual evaluation
> on the GPU too, after the data has been transferred?

Technically it is possible by extracting the underlying GPU buffers from 
the vector objects and by manually managing the Field data. Frankly I 
don't know about the current state of the local-to-global mappings, you 
likely have to do quite some copying of data between host and device 
manually.

> Ex42 shows how it
> can be done using cusp but it looks really ugly and I want to use
> OpenCL. Basically can I do something like this?
>
> DMGetLocalVector(da, &localX); //Vector is now in GPU.
> DMDAVecGetArray(da, localX, &x); //Array is on GPU.
>
> //Create buffers for OpenCL
> buffer = cl::Buffer(context, CL_MEM_USE_HOST_PTR |
>                                                  CL_MEM_READ_WRITE,
>                                    sizeofarray, &x[X2Start-Ng][X1Start-Ng]
>                                     , &clErr);
>
> (I'm hoping that here CL_MEM_USE_HOST_PTR will give a pointer to the
> data already on the GPU)
>
> // Launch OpenCL kernels and now map the buffers to read off the data.
>
> DMDAVecRestoreArray(da, localX, &x);
> DMRestoreLocalVector(da, &localX);
>
> I think the question is whether DMDAVecGetArray will return a pointer to
> the data on the GPU or not.

*VecGetArray() will always return a pointer due to the inability to 
overload functions in C. Buffers in OpenCL are of type cl_mem, so this 
won't work. Also, you won't be able to copy a two-dimensional array with 
just one pointer &x[][]. As far as I know, we don't have any API which 
provides GPU buffers directly, but maybe Matt added some functions for 
this to work with FEM recently.

As far as I can tell, only providing the kernel won't suffice because we 
don't have the GPU-implementations for 'Field' data available. Hence, 
you would have to copy the x and b arrays manually and then copy 
everything back, which is most likely too much of a performance hit to 
be worth the effort. Since GPUs are getting more and more integrated 
into CPUs, it's questionable whether it's worth the time to implement 
such additional memory management for accelerators if they disappear in 
their discrete PCI-Express form in a few years from now...

Best regards,
Karli