On Thu, Sep 29, 2011 at 11:28 AM, Matija Kecman <span dir="ltr">&lt;<a href="mailto:matijakecman@gmail.com">matijakecman@gmail.com</a>&gt;</span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

Thanks for your response Jed! I&#39;ve been doing some other<br>

investigations using this example. I made some small modifications:<br>

<br>

1. Added preallocation as Jed Brown suggested in a previous email<br>

(<a href="http://lists.mcs.anl.gov/pipermail/petsc-users/2011-June/009054.html" target="_blank">http://lists.mcs.anl.gov/pipermail/petsc-users/2011-June/009054.html</a>).<br>

2. Added a small VTK viewer.<br>

3. Set the initial guess to zero.<br>

4. Changed the entries in the element stiffness matrix to the following:<br>

<br>

  Ke[ 0] =  2./3.; Ke[ 1] = -1./6.; Ke[ 2] = -1./3.; Ke[ 3] = -1./6.;<br>

  Ke[ 4] = -1./6.; Ke[ 5] =  2./3.; Ke[ 6] = -1./6.; Ke[ 7] = -1./3.;<br>

  Ke[ 8] = -1./3.; Ke[ 9] = -1./6.; Ke[10] =  2./3.; Ke[11] = -1./6.;<br>

  Ke[12] = -1./6.; Ke[13] = -1./3.; Ke[14] = -1./6.; Ke[15] =  2./3.;<br>

<br>

I computed these by evaluating $K^e_{ij} = \int_{\Omega_e} \nabla<br>

\psi^e_i \cdot \nabla \psi^e_j \, \mathrm{d}\Omega$ with the shape<br>

functions $\psi^e$ corresponding to a bilinear quadratic finite<br>

element denoted by $\Omega_e$. This is different to what was<br>

originally in the code and I&#39;m not sure where the original code comes<br>

from. This isn&#39;t important so you can just ignore it if you like, I<br>

get the same solution using both matrices.<br>

<br>

---<br>

<br>

I am running on a two compute node clusters which each look as<br>

follows: 2 quad-core Intel Xeon 5345 processors, 16GB memory. The node<br>

clusters are connected with the following interconnect: Mellanox<br>

InfiniScale 2400. I computed my results using a machine file which<br>

specifies that (up to) the first 8 processes are computed on node1 and<br>

the second group of 8 processes are computed on node2. My timing<br>

results are shown in the table below, I&#39;m running each test using<br>

Bi-CGStab with no preconditioning (-ksp_type bcgs -pc_type none) on a<br>

computational grid of 800 x 800 cells, so 641601 DOFs. I have attached<br>

my modified source code (you could look at my changes using diff) and<br>

the -log_summary output for each of the tests.<br></blockquote><div><br></div><div>The way I read these numbers is that there is bandwidth for about 3 cores on</div><div>this machine, and non-negligible synchronization penalty:</div>

<div><br></div><div>                        1 proc   2 proc   4 proc    8 proc</div><div>VecAXPBYCZ    496        857      1064      1070</div><div>VecDot               451        724      1089        736</div><div>MatMult              434        638        701        703</div>

<div><br></div><div>The bandwidth tops out between 2 and 4 cores (The 5345 should have 10.6 GB/s</div><div>but you should runs streams as Barry says to see what is achievable). There is</div><div>obviously a penalty for VecDot against VecAXPYCZ, which is the sync penalty</div>

<div>which also seems to affect MatMult. Maybe Jed can explain that.</div><div><br></div><div>    Matt</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


# number of processes | time for KSPsolve() | iterations to<br>

convergence | norm of error<br>

1 64.008 692 0.00433961<br>

2 36.2767 626 0.00611835<br>

<a href="tel:4%2035.9989%20760" value="+14359989760">4 35.9989 760</a> 0.00311053<br>

<a href="tel:8%2030.5215%20664" value="+18305215664">8 30.5215 664</a> 0.00599148<br>

16 14.1164 710 0.00792162<br>

<br>

Why is the scaling so poor? I have read the FAQ<br>

(<a href="http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers" target="_blank">http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers</a>),<br>

am I experiencing the problem described? I think my machine has a<br>

bandwidth of 2GB/s per process as suggested. Also, how can you tell if<br>

a computation is memory bound by looking at the -log_summary?<br>

<br>

Many thanks,<br>

<br>

Matija<br>

<br>

On Tue, Sep 20, 2011 at 11:44 AM, Jed Brown &lt;<a href="mailto:jedbrown@mcs.anl.gov">jedbrown@mcs.anl.gov</a>&gt; wrote:<br>

&gt;<br>

&gt; On Tue, Sep 20, 2011 at 11:45, Matija Kecman &lt;<a href="mailto:matijakecman@gmail.com">matijakecman@gmail.com</a>&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt; $ mpirun -np 1 ./ex3 -ksp_type gmres -pc_type none -m 100<br>

&gt;&gt; Norm of error 0.570146 Iterations 0<br>

&gt;<br>

&gt; This uses a nonzero initial guess so the initial residual norm is compared to the right hand side.<br>

&gt; $ ./ex3 -ksp_type gmres -ksp_monitor -m 100 -pc_type none -ksp_converged_reason -info |grep Converged<br>

&gt; [0] KSPDefaultConverged(): user has provided nonzero initial guess, computing 2-norm of preconditioned RHS<br>

&gt; [0] KSPDefaultConverged(): Linear solver has converged. Residual norm 1.113646413065e-04 is less than relative tolerance 1.000000000000e-05 times initial right hand side norm 1.291007358616e+01 at iteration 0<br>

&gt; You can use the true residual, it just costs something so it&#39;s not enabled by default:<br>

&gt; $ ./ex3 -ksp_type gmres -ksp_monitor -m 100 -pc_type none -ksp_converged_reason -ksp_converged_use_initial_residual_norm<br>

&gt; [many iterations]<br>

&gt; Linear solve converged due to CONVERGED_RTOL iterations 1393<br>

&gt; Norm of error 0.000664957 Iterations 1393<br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

-- Norbert Wiener<br>