<div class="gmail_quote">On Tue, Apr 24, 2012 at 14:29, Daniel Lowell <span dir="ltr"><<a href="mailto:redratio1@gmail.com">redratio1@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Launching smaller overlapping asynchronous kernels can have speed up if your vectors are large and you are doing reductions. This way warps stalls can be compensated for, and latencies can be hidden. Not sure what you mean "the way it currently is" though...</blockquote>

</div><br><div>The reduction is only needed at the end. Any sequential launch adds artificial synchronization. I'd be interested to see the performance comparison, but I'd be surprised if independent kernel launches were faster than a decent implementation with one kernel launch.</div>