[petsc-dev] Performance of VecMDot_SeqCUSP

Tue Apr 24 14:49:14 CDT 2012

Doing tree-wise parallel reductions on a single dot does leave most of the
threads stalled by the end of the kernel execution. I keep this
inefficiency built-in and take advantage of it by launching overlapping
kernels and pipelining the dot operation, breaking it up into smaller
kernels.  Obviously this only makes sense for large vectors. I haven't
tried implementing the entire MDot as one large kernel though. Might be
worthwhile though....

On Tue, Apr 24, 2012 at 2:42 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:

> On Tue, Apr 24, 2012 at 14:29, Daniel Lowell <redratio1 at gmail.com> wrote:
>
>> Launching smaller overlapping asynchronous kernels can have speed up if
>> your vectors are large and you are doing reductions. This way warps stalls
>> can be compensated for, and latencies can be hidden. Not sure what you mean
>> "the way it currently is" though...
>
>
> The reduction is only needed at the end. Any sequential launch adds
> artificial synchronization. I'd be interested to see the performance
> comparison, but I'd be surprised if independent kernel launches were faster
> than a decent implementation with one kernel launch.
>

-- 
Daniel Lowell
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120424/15c91f97/attachment.html>