[petsc-dev] Performance of VecMDot_SeqCUSP

Daniel Lowell redratio1 at gmail.com
Tue Apr 24 14:29:48 CDT 2012


Launching smaller overlapping asynchronous kernels can have speed up if
your vectors are large and you are doing reductions. This way warps stalls
can be compensated for, and latencies can be hidden. Not sure what you mean
"the way it currently is" though...

On Tue, Apr 24, 2012 at 2:21 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:

> On Tue, Apr 24, 2012 at 14:12, Daniel Lowell <redratio1 at gmail.com> wrote:
>
>> I'm writing a vector type with uses flag synching like you have in PETSc
>> with Vec CUSP, however it uses asynchronous kernel launches
>> (pipeling,etc..) and autotuned kernels. Not quite ready for primetime, but
>> we have seen the value of it in terms of speed up.
>
>
> Okay, but why do dozens of small kernel launches when all the data is
> available up-front? I'm just skeptical that VecMDot should be implemented
> for CUDA the way it currently is.
>



-- 
Daniel Lowell
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120424/98735b70/attachment.html>


More information about the petsc-dev mailing list