<div class="gmail_quote">On Thu, Sep 29, 2011 at 07:44, Matthew Knepley <span dir="ltr"><<a href="mailto:knepley@gmail.com">knepley@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div>The way I read these numbers is that there is bandwidth for about 3 cores on</div><div>this machine, and non-negligible synchronization penalty:</div>
<div><br></div><div> 1 proc 2 proc 4 proc 8 proc</div><div>VecAXPBYCZ 496 857 1064 1070</div><div>VecDot 451 724 1089 736</div><div>MatMult 434 638 701 703</div>
</blockquote><div><br></div><div>Matt, thanks for pulling out this summary. The synchronization for the dot product is clearly expensive here. I'm surprised it's so significant on such a small problem, but it is a common problem for scalability.</div>
<div><br></div><div>I think that MPI is putting one process per socket when you go from 1 to 2 processes. This gives you pretty good speedup despite the fact that memory traffic for both sockets is routed through the "Blackford" chipset. Intel fixed this in later generations by throwing out the notion of uniform memory access by having independent memory banks for each socket (which AMD had at the time these chips came out).</div>
<div><br></div><div>If not odd hardware issues such as having enough memory streams and imbalances, one process per socket can saturate the memory bandwidth, so the speed-up you get from 2 procs (1 per socket) to 4 procs (2 per socket) is minimal. On this architecture, you typically see the STREAM bandwidth go down slightly as you add more than 2 procs per socket, so it's no surprise that it doesn't help here.</div>
<div><br></div><div>Note that your naive 1D partition is getting worse as you add processes. The MatMult should scale out somewhat better if you use a 2D decomposition, as is done by any of the examples using a DA (DMDA in 3.2) for grid management.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div><br></div><div>The bandwidth tops out between 2 and 4 cores (The 5345 should have 10.6 GB/s</div><div>but you should runs streams as Barry says to see what is achievable). There is</div><div>obviously a penalty for VecDot against VecAXPYCZ, which is the sync penalty</div>
<div>which also seems to affect MatMult. Maybe Jed can explain that.</div></blockquote></div><div><br></div>