On Tue, Mar 13, 2012 at 3:22 PM, Dave May <span dir="ltr">&lt;<a href="mailto:dave.mayhem23@gmail.com">dave.mayhem23@gmail.com</a>&gt;</span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hey Matt,<br>

  Do you have any guidance or ideas regarding how large the subdomains<br>

should be to offset the cost of this copy?<br></blockquote><div><br></div><div>They have to be substantial, but it depends on your arithmetic intensity. What I can say</div><div>for sure is that we maxed out the memory on the machines we have (like my laptop) and</div>

<div>saw 3-5x speed up with SpMV.</div><div><br></div><div>Also, I saw a TON of overhead on my FEM benchmark, even though it is screaming on</div><div>the GPU, but I now think that is cudaMalloc()/Free() rather than all communication.</div>

<div><br></div><div>   Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Cheers,<br>

  Dave<br>

<br>

<br>

On 13 March 2012 15:03, Matthew Knepley &lt;<a href="mailto:knepley@gmail.com">knepley@gmail.com</a>&gt; wrote:<br>

&gt; On Tue, Mar 13, 2012 at 8:59 AM, Xiangze Zeng &lt;<a href="mailto:zengshixiangze@163.com">zengshixiangze@163.com</a>&gt;<br>

&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt; Hi, Jed.<br>

&gt;&gt; At the beginning and end of  the codes for setting the matrices values, I<br>

&gt;&gt; add &quot;printf&quot;, and compute the time of this period. It is much longer than<br>

&gt;&gt; that when I don&#39;t use the GPU. I just guess the time is used for copping<br>

&gt;&gt; data. My PCTYPE is sor. And 2000 iterations.  Do you have any suggestion<br>

&gt;&gt; about this?<br>

&gt;<br>

&gt;<br>

&gt; 1) You do not have to guess. Use -log_summary, and there are explicit events<br>

&gt; for copying to the GPU<br>

&gt;<br>

&gt; 2) GPUs only really become effective for large systems due to this overhead.<br>

&gt; I suggest looking at the<br>

&gt;     performance and overhead as a function of system size.<br>

&gt;<br>

&gt;    Matt<br>

&gt;<br>

&gt;&gt;<br>

&gt;&gt; Zeng<br>

&gt;&gt;<br>

&gt;&gt; 在 2012-03-13 20:12:09，&quot;Jed Brown&quot; &lt;<a href="mailto:jedbrown@mcs.anl.gov">jedbrown@mcs.anl.gov</a>&gt; 写道：<br>

&gt;&gt;<br>

&gt;&gt; 2012/3/13 Xiangze Zeng &lt;<a href="mailto:zengshixiangze@163.com">zengshixiangze@163.com</a>&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; After I  configure PETSc using --with-precision=single, I can run both<br>

&gt;&gt;&gt; ex19 and my own code. Good news! But it seems lots of time is using for<br>

&gt;&gt;&gt; copping the data from CPU to GPU.<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; How are you measuring? What preconditioner are you using and how many<br>

&gt;&gt; iterations are typically required?<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;<br>

&gt;<br>

<span class="HOEnZb"><font color="#888888">&gt;<br>

&gt; --<br>

&gt; What most experimenters take for granted before they begin their experiments<br>

&gt; is infinitely more interesting than any results to which their experiments<br>

&gt; lead.<br>

&gt; -- Norbert Wiener<br>

</font></span></blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

-- Norbert Wiener<br>