<div class="gmail_quote">On Tue, Apr 24, 2012 at 14:12, Daniel Lowell <span dir="ltr"><<a href="mailto:redratio1@gmail.com">redratio1@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I'm writing a vector type with uses flag synching like you have in PETSc with Vec CUSP, however it uses asynchronous kernel launches (pipeling,etc..) and autotuned kernels. Not quite ready for primetime, but we have seen the value of it in terms of speed up.</blockquote>

</div><br><div>Okay, but why do dozens of small kernel launches when all the data is available up-front? I'm just skeptical that VecMDot should be implemented for CUDA the way it currently is.</div>