[petsc-dev] SNES ex19 not using GPU despite passing the options
Karl Rupp
rupp at mcs.anl.gov
Tue Jan 14 16:49:59 CST 2014
Hi Mani,
> Thanks for the reply. That fixed it. I get only a 10% speed up using the
> cusp options. Is the residual evaluation at each iteration happening on
> the CPU or the GPU?
The residual evaluation happens on the CPU unless there is a dedicated
kernel provided for this (which is not the case in ex19)
> Is there anyway one can do the residual evaluation
> on the GPU too, after the data has been transferred?
Technically it is possible by extracting the underlying GPU buffers from
the vector objects and by manually managing the Field data. Frankly I
don't know about the current state of the local-to-global mappings, you
likely have to do quite some copying of data between host and device
manually.
> Ex42 shows how it
> can be done using cusp but it looks really ugly and I want to use
> OpenCL. Basically can I do something like this?
>
> DMGetLocalVector(da, &localX); //Vector is now in GPU.
> DMDAVecGetArray(da, localX, &x); //Array is on GPU.
>
> //Create buffers for OpenCL
> buffer = cl::Buffer(context, CL_MEM_USE_HOST_PTR |
> CL_MEM_READ_WRITE,
> sizeofarray, &x[X2Start-Ng][X1Start-Ng]
> , &clErr);
>
> (I'm hoping that here CL_MEM_USE_HOST_PTR will give a pointer to the
> data already on the GPU)
>
> // Launch OpenCL kernels and now map the buffers to read off the data.
>
> DMDAVecRestoreArray(da, localX, &x);
> DMRestoreLocalVector(da, &localX);
>
> I think the question is whether DMDAVecGetArray will return a pointer to
> the data on the GPU or not.
*VecGetArray() will always return a pointer due to the inability to
overload functions in C. Buffers in OpenCL are of type cl_mem, so this
won't work. Also, you won't be able to copy a two-dimensional array with
just one pointer &x[][]. As far as I know, we don't have any API which
provides GPU buffers directly, but maybe Matt added some functions for
this to work with FEM recently.
As far as I can tell, only providing the kernel won't suffice because we
don't have the GPU-implementations for 'Field' data available. Hence,
you would have to copy the x and b arrays manually and then copy
everything back, which is most likely too much of a performance hit to
be worth the effort. Since GPUs are getting more and more integrated
into CPUs, it's questionable whether it's worth the time to implement
such additional memory management for accelerators if they disappear in
their discrete PCI-Express form in a few years from now...
Best regards,
Karli
More information about the petsc-dev
mailing list