<div dir="ltr">On Tue, Sep 24, 2013 at 7:07 AM, Karl Rupp <span dir="ltr"><<a href="mailto:rupp@mcs.anl.gov" target="_blank">rupp@mcs.anl.gov</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hey,<div><div class="h5"><br>

<br>

On 09/24/2013 03:53 PM, Jed Brown wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Karl Rupp <<a href="mailto:rupp@mcs.anl.gov" target="_blank">rupp@mcs.anl.gov</a>> writes:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I'm not talking about CSR vs. COO from the SpMV point of view, but<br>

rather on how to store the actual data in global memory without<br>

expensive subsequent sorts.<br>

</blockquote>

<br>

Sure, but this seems like such a minor detail.  With PetscScalar=double<br>

and PetscInt=int, we have 16 bytes/entry for COO and (nominally) 12<br>

bytes/entry for CSR, and it only needs to go to GPU global memory and<br>

back, not across to the CPU.  I doubt the difference between 12 and 16<br>

bytes/entry during assembly is a bottleneck.<br>

</blockquote>

<br></div></div>

I'm not worried about 12 bytes vs. 16 bytes, but rather about the ordering of entries as a whole. If one assembles into something CSR-like, then one can either run the SpMV right away, or merge entries in each row of the matrix which have the same column indices. Merging such entries can usually be done in shared memory, so the memory costs is one read and write of the matrix nonzero entries in worst case.<br>


<br>

On the contrary, if everything is assembled into a general COO format, then one needs to sort the triplets by row first in order to be even able to run SpMVs. The memory transactions required for this are<br>

O(N log(N)) with N being the number of nonzeros. N is in almost all cases larger than 10^6, so the log(N) hurts...<br></blockquote><div><br></div><div>Here I believe strongly that we need tests. Nathan assured me that nothing is faster on the GPU than sort+reduce-by-key since</div>

<div>they are highly optimized. I think they will be hard to beat, and the initial timings I had say that this is the case. I am willing to be</div><div>wrong, but I am not willing to overengineer based on supposition.</div>

<div><br></div><div>  Thanks,</div><div><br></div><div>       Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Best regards,<br>

Karli<br>

<br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

-- Norbert Wiener

</div></div>