<br><div class="gmail_extra">On Sun, Nov 4, 2012 at 9:51 PM, Karl Rupp <span dir="ltr"><<a href="mailto:rupp@mcs.anl.gov" target="_blank">rupp@mcs.anl.gov</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi Matt,<div class="im"><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Very cool. I did not expect the large AMD overheads, but the results<br>

below make the current CUDA strategy look<br>

pretty good, and an OpenCL strategy look fine for large problems.<br>

<br>

I have a more basic question. I saw the previous implementation in very<br>

simple terms:<br>

<br>

   1) Replication of CPU vectors on the GPU<br>

<br>

   2) A coherence policy<br>

<br>

The nice thing here is how robust it is. We really do not have to commit<br>

to any implementation because<br>

the CPU part can always pick up the slack. From what is written below, I<br>

cannot understand how the<br>

"coherence policy" works.<br>

<br>

Lets use an example to explain it to me. Say that you have a CUDA<br>

vector, but you want to execute<br>

VecPointwiseMult() with another Vec, but that operation is not part of<br>

your CUDA implementation.<br>

What happens?<br>

<br>

</blockquote>

<br></div>

It works in the same way as it does now:<br>

<br>

  if (memory_handle.active_handle != main_memory_flag) {<br>

    copy_to_cpu();<br>

    memory_handle.active_handle = main_memory_flag;<br>

  }<br>

  process_on_cpu();<br>

<br>

If each of the memory regions carries a 'valid-flag', the flag for the CUDA part is set to invalid after the processing.<br></blockquote><div><br></div><div>Great. Then I understand much more.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


The copy_to_cpu-part could be managed via page-locked memory, yet I still have to investigate its robustness. It is certainly a nice option for APUs, because it has zero overhead. The only drawback is that APUs just lack performance in general...<br>


<br>

Overall, I don't want to give away the robustness you described above. There will always be some operations that work better on the CPU, while others work better on accelerators, so hopping between them is (unfortunately) rather the rule than the exception in real world applications.<br>

</blockquote><div><br></div><div>Okay, then I have a procedural question. Should we write about this for the GPU book, or just describe the</div><div>current state of affairs?</div><div><br></div><div>  Thanks,</div><div>

<br></div><div>      Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Best regards,<br>

Karli<br>

<br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

-- Norbert Wiener<br>

</div>