On Thu, Sep 29, 2011 at 11:28 AM, Matija Kecman <span dir="ltr"><<a href="mailto:matijakecman@gmail.com">matijakecman@gmail.com</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Thanks for your response Jed! I've been doing some other<br>
investigations using this example. I made some small modifications:<br>
<br>
1. Added preallocation as Jed Brown suggested in a previous email<br>
(<a href="http://lists.mcs.anl.gov/pipermail/petsc-users/2011-June/009054.html" target="_blank">http://lists.mcs.anl.gov/pipermail/petsc-users/2011-June/009054.html</a>).<br>
2. Added a small VTK viewer.<br>
3. Set the initial guess to zero.<br>
4. Changed the entries in the element stiffness matrix to the following:<br>
<br>
Ke[ 0] = 2./3.; Ke[ 1] = -1./6.; Ke[ 2] = -1./3.; Ke[ 3] = -1./6.;<br>
Ke[ 4] = -1./6.; Ke[ 5] = 2./3.; Ke[ 6] = -1./6.; Ke[ 7] = -1./3.;<br>
Ke[ 8] = -1./3.; Ke[ 9] = -1./6.; Ke[10] = 2./3.; Ke[11] = -1./6.;<br>
Ke[12] = -1./6.; Ke[13] = -1./3.; Ke[14] = -1./6.; Ke[15] = 2./3.;<br>
<br>
I computed these by evaluating $K^e_{ij} = \int_{\Omega_e} \nabla<br>
\psi^e_i \cdot \nabla \psi^e_j \, \mathrm{d}\Omega$ with the shape<br>
functions $\psi^e$ corresponding to a bilinear quadratic finite<br>
element denoted by $\Omega_e$. This is different to what was<br>
originally in the code and I'm not sure where the original code comes<br>
from. This isn't important so you can just ignore it if you like, I<br>
get the same solution using both matrices.<br>
<br>
---<br>
<br>
I am running on a two compute node clusters which each look as<br>
follows: 2 quad-core Intel Xeon 5345 processors, 16GB memory. The node<br>
clusters are connected with the following interconnect: Mellanox<br>
InfiniScale 2400. I computed my results using a machine file which<br>
specifies that (up to) the first 8 processes are computed on node1 and<br>
the second group of 8 processes are computed on node2. My timing<br>
results are shown in the table below, I'm running each test using<br>
Bi-CGStab with no preconditioning (-ksp_type bcgs -pc_type none) on a<br>
computational grid of 800 x 800 cells, so 641601 DOFs. I have attached<br>
my modified source code (you could look at my changes using diff) and<br>
the -log_summary output for each of the tests.<br></blockquote><div><br></div><div>The way I read these numbers is that there is bandwidth for about 3 cores on</div><div>this machine, and non-negligible synchronization penalty:</div>
<div><br></div><div> 1 proc 2 proc 4 proc 8 proc</div><div>VecAXPBYCZ 496 857 1064 1070</div><div>VecDot 451 724 1089 736</div><div>MatMult 434 638 701 703</div>
<div><br></div><div>The bandwidth tops out between 2 and 4 cores (The 5345 should have 10.6 GB/s</div><div>but you should runs streams as Barry says to see what is achievable). There is</div><div>obviously a penalty for VecDot against VecAXPYCZ, which is the sync penalty</div>
<div>which also seems to affect MatMult. Maybe Jed can explain that.</div><div><br></div><div> Matt</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
# number of processes | time for KSPsolve() | iterations to<br>
convergence | norm of error<br>
1 64.008 692 0.00433961<br>
2 36.2767 626 0.00611835<br>
<a href="tel:4%2035.9989%20760" value="+14359989760">4 35.9989 760</a> 0.00311053<br>
<a href="tel:8%2030.5215%20664" value="+18305215664">8 30.5215 664</a> 0.00599148<br>
16 14.1164 710 0.00792162<br>
<br>
Why is the scaling so poor? I have read the FAQ<br>
(<a href="http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers" target="_blank">http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers</a>),<br>
am I experiencing the problem described? I think my machine has a<br>
bandwidth of 2GB/s per process as suggested. Also, how can you tell if<br>
a computation is memory bound by looking at the -log_summary?<br>
<br>
Many thanks,<br>
<br>
Matija<br>
<br>
On Tue, Sep 20, 2011 at 11:44 AM, Jed Brown <<a href="mailto:jedbrown@mcs.anl.gov">jedbrown@mcs.anl.gov</a>> wrote:<br>
><br>
> On Tue, Sep 20, 2011 at 11:45, Matija Kecman <<a href="mailto:matijakecman@gmail.com">matijakecman@gmail.com</a>> wrote:<br>
>><br>
>> $ mpirun -np 1 ./ex3 -ksp_type gmres -pc_type none -m 100<br>
>> Norm of error 0.570146 Iterations 0<br>
><br>
> This uses a nonzero initial guess so the initial residual norm is compared to the right hand side.<br>
> $ ./ex3 -ksp_type gmres -ksp_monitor -m 100 -pc_type none -ksp_converged_reason -info |grep Converged<br>
> [0] KSPDefaultConverged(): user has provided nonzero initial guess, computing 2-norm of preconditioned RHS<br>
> [0] KSPDefaultConverged(): Linear solver has converged. Residual norm 1.113646413065e-04 is less than relative tolerance 1.000000000000e-05 times initial right hand side norm 1.291007358616e+01 at iteration 0<br>
> You can use the true residual, it just costs something so it's not enabled by default:<br>
> $ ./ex3 -ksp_type gmres -ksp_monitor -m 100 -pc_type none -ksp_converged_reason -ksp_converged_use_initial_residual_norm<br>
> [many iterations]<br>
> Linear solve converged due to CONVERGED_RTOL iterations 1393<br>
> Norm of error 0.000664957 Iterations 1393<br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>
-- Norbert Wiener<br>