Slow speed after changing from serial to parallel (with ex2f.F)

Wed Apr 16 00:25:45 CDT 2008

On Wed, 16 Apr 2008, Ben Tay wrote:

> Hi Satish, thank you very much for helping me run the ex2f.F code.
> 
> I think I've a clearer picture now. I believe I'm running on Dual-Core Intel
> Xeon 5160. The quad core is only on atlas3-01 to 04 and there's only 4 of
> them. I guess that the lower peak is because I'm using Xeon 5160, while you
> are using Xeon X5355.

I'm still a bit puzzled. I just ran the same binary on a 2 dualcore
xeon 5130 machine [which should be similar to your 5160 machine] and
get the following:

[balay at n001 ~]$ grep MatMult log*
log.1:MatMult             1192 1.0 1.0591e+01 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11  0  0  0  14 11  0  0  0   364
log.2:MatMult             1217 1.0 6.3982e+00 1.0 1.97e+09 1.0 2.4e+03 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   615
log.4:MatMult              969 1.0 4.7780e+00 1.0 7.84e+08 1.0 5.8e+03 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   656
[balay at n001 ~]$ 

> You mention about the speedups for MatMult and compare between KSPSolve. Are
> these the only things we have to look at? Because I see that some other event
> such as VecMAXPY also takes up a sizable % of the time. To get an accurate
> speedup, do I just compare the time taken by KSPSolve between different no. of
> processors or do I have to look at other events such as MatMult as well?

Sometimes we look at individual components like MatMult() VecMAXPY()
to understand whats hapenning in each stage - and at KSPSolve() to
look at the agregate performance for the whole solve [which includes
MatMult VecMAXPY etc..]. Perhaps I should have also looked at
VecMDot() aswell - at 48% of runtime - its the biggest contributor to
KSPSolve() for your run.

Its easy to get lost in the details of log_summary. Looking for
anamolies is one thing. Plotting scalability charts for the solver is
something else..

> In summary, due to load imbalance, my speedup is quite bad. So maybe I'll just
> send your results to my school's engineer and see if they could do anything.
> For my part, I guess I'll just 've to wait?

Yes - load imbalance at MatMult level is bad. On 4 proc run you have
ratio = 3.6 . This implies - there is one of the mpi-tasks is 3.6
times slower than the other task [so all speedup is lost here]

You could try the latest mpich2 [1.0.7] - just for this SMP
experiment, and see if it makes a difference. I've built mpich2 with
[default gcc/gfortran and]:

./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker

There could be something else going on on this machine thats messing
up load-balance for basic petsc example..

Satish