[petsc-dev] Current status: GPUs for PETSc

Sun Nov 4 20:51:30 CST 2012

Hi Matt,

> Very cool. I did not expect the large AMD overheads, but the results
> below make the current CUDA strategy look
> pretty good, and an OpenCL strategy look fine for large problems.
>
> I have a more basic question. I saw the previous implementation in very
> simple terms:
>
>    1) Replication of CPU vectors on the GPU
>
>    2) A coherence policy
>
> The nice thing here is how robust it is. We really do not have to commit
> to any implementation because
> the CPU part can always pick up the slack. From what is written below, I
> cannot understand how the
> "coherence policy" works.
>
> Lets use an example to explain it to me. Say that you have a CUDA
> vector, but you want to execute
> VecPointwiseMult() with another Vec, but that operation is not part of
> your CUDA implementation.
> What happens?
>

It works in the same way as it does now:

   if (memory_handle.active_handle != main_memory_flag) {
     copy_to_cpu();
     memory_handle.active_handle = main_memory_flag;
   }
   process_on_cpu();

If each of the memory regions carries a 'valid-flag', the flag for the 
CUDA part is set to invalid after the processing.

The copy_to_cpu-part could be managed via page-locked memory, yet I 
still have to investigate its robustness. It is certainly a nice option 
for APUs, because it has zero overhead. The only drawback is that APUs 
just lack performance in general...

Overall, I don't want to give away the robustness you described above. 
There will always be some operations that work better on the CPU, while 
others work better on accelerators, so hopping between them is 
(unfortunately) rather the rule than the exception in real world 
applications.

Best regards,
Karli