<div class="gmail_quote">On Tue, Mar 20, 2012 at 08:55, Satish Balay <span dir="ltr"><<a href="mailto:balay@mcs.anl.gov">balay@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div id=":4df">are you pinning the mpi jobs to specific cores for this tests? Does it<br>
make a difference?<br></div></blockquote><div><br></div><div>It didn't seem to make a difference with 32 procs, but note that this example fits everything in cache. We should have a less contrived use case before putting more effort into it.</div>
<div><br></div><div>With 64 procs, binding seems more important.</div><div><br></div><div><div><font face="'courier new', monospace">jedbrown@cg:~/petsc/src/vec/vec/examples/tests$ ~/usr/mpich/bin/mpiexec -n 64 -binding rr ./ex42 -log_summary -splitreduction_async 0 | grep '^Vec'</font></div>
<div><font face="'courier new', monospace">Vector Object: 64 MPI processes</font></div><div><font face="'courier new', monospace">VecView 1 1.0 2.3429e-0387.0 0.00e+00 0.0 6.3e+01 4.0e+01 0.0e+00 0 0 0 0 0 0 0 0 0 0 0</font></div>
<div><font face="'courier new', monospace">VecSet 1 1.0 2.1935e-05 7.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0</font></div><div><font face="'courier new', monospace">VecAssemblyBegin 1 1.0 2.7320e-03 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+00 5 0 0 0 3 5 0 0 0 3 0</font></div>
<div><font face="'courier new', monospace">VecAssemblyEnd 1 1.0 3.0994e-0510.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0</font></div><div><font face="'courier new', monospace">VecScatterBegin 300 1.0 7.4935e-04 1.4 0.00e+00 0.0 1.9e+04 4.0e+01 0.0e+00 1 0 98 99 0 1 0 98 99 0 0</font></div>
<div><font face="'courier new', monospace">VecScatterEnd 300 1.0 1.2340e-02 7.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 13 0 0 0 0 13 0 0 0 0 0</font></div><div><font face="'courier new', monospace">VecReduceArith 100 1.0 1.7476e-04 2.2 9.00e+02 1.0 0.0e+00 0.0e+00 0.0e+00 0100 0 0 0 0100 0 0 0 330</font></div>
<div><font face="'courier new', monospace">VecReduceComm 100 1.0 2.9184e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+02 57 0 0 0 89 58 0 0 0 90 0</font></div></div><div><font face="'courier new', monospace"><br>
</font></div><div><div><font face="'courier new', monospace">jedbrown@cg:~/petsc/src/vec/vec/examples/tests$ ~/usr/mpich/bin/mpiexec -n 64 -binding rr ./ex42 -log_summary -splitreduction_async 1 | grep '^Vec'</font></div>
<div><font face="'courier new', monospace">Vector Object: 64 MPI processes</font></div><div><font face="'courier new', monospace">VecView 1 1.0 1.7929e-0394.0 0.00e+00 0.0 6.3e+01 4.0e+01 0.0e+00 1 0 0 0 0 1 0 0 0 0 0</font></div>
<div><font face="'courier new', monospace">VecSet 1 1.0 2.6941e-05 5.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0</font></div><div><font face="'courier new', monospace">VecAssemblyBegin 1 1.0 2.5909e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+00 10 0 0 0 25 10 0 0 0 27 0</font></div>
<div><font face="'courier new', monospace">VecAssemblyEnd 1 1.0 2.4080e-05 4.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0</font></div><div><font face="'courier new', monospace">VecScatterBegin 300 1.0 1.1673e-03 3.4 0.00e+00 0.0 1.9e+04 4.0e+01 0.0e+00 4 0 98 99 0 4 0 98 99 0 0</font></div>
<div><font face="'courier new', monospace">VecScatterEnd 300 1.0 2.0361e-03 3.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 7 0 0 0 0 7 0 0 0 0 0</font></div><div><font face="'courier new', monospace">VecReduceArith 100 1.0 2.0885e-04 2.9 9.00e+02 1.0 0.0e+00 0.0e+00 0.0e+00 1100 0 0 0 1100 0 0 0 276</font></div>
<div><font face="'courier new', monospace">VecReduceBegin 100 1.0 6.0058e-04 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 3 0 0 0 0 0</font></div><div><font face="'courier new', monospace">VecReduceEnd 100 1.0 2.7788e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 13 0 0 0 0 13 0 0 0 0 0</font></div>
</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":4df">
<br>
I'm curious as this machine has assymetric cores wrt L2/FPU.<br>
Presumably - using p0,p2,p4 etc should spread out the load -<br>
but I don't know if the kernel is doing this automatically.</div></blockquote></div><br>