<div dir="ltr"><div class="gmail_extra">I'm interested in seeing this too, especially if somebody can explain the results after they've been demonstrated :)</div><div class="gmail_extra"><br></div><div class="gmail_extra">

A<br><br><div class="gmail_quote">On Tue, Apr 24, 2012 at 10:42 PM, Jed Brown <span dir="ltr"><<a href="mailto:jedbrown@mcs.anl.gov" target="_blank">jedbrown@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im"><div class="gmail_quote">On Tue, Apr 24, 2012 at 14:29, Daniel Lowell <span dir="ltr"><<a href="mailto:redratio1@gmail.com" target="_blank">redratio1@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Launching smaller overlapping asynchronous kernels can have speed up if your vectors are large and you are doing reductions. This way warps stalls can be compensated for, and latencies can be hidden. Not sure what you mean "the way it currently is" though...</blockquote>


</div><br></div><div>The reduction is only needed at the end. Any sequential launch adds artificial synchronization. I'd be interested to see the performance comparison, but I'd be surprised if independent kernel launches were faster than a decent implementation with one kernel launch.</div>


</blockquote></div><br></div></div>