PETSc runs slower on a shared memory machine than on a cluster
Satish Balay
balay at mcs.anl.gov
Sat Feb 3 19:00:04 CST 2007
On Sat, 3 Feb 2007, Shi Jin wrote:
> I do see that the cluster run is faster than the shared-memory
> case. However, I am not sure how I can tell the reason for this
> behavior is due to the memory subsystem. I don't know what evidence
> in the log to look for.
There were too many linewraps in the e-mailed text. Its best to send
such text as attachments so that the format is preserved [and
readable]
Event Count Time (sec) Flops/sec --- Global --- ---Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
<cluster>
VecScatterBegin 137829 1.0 2.1816e+01 1.9 0.00e+00 0.0 8.3e+05 3.4e+04 0.0e+00 0 0 91 74 0 1 0100100 0 0
VecScatterEnd 137730 1.0 3.0302e+01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
MatMult 137730 1.0 3.5652e+02 1.3 2.58e+08 1.2 8.3e+05 3.4e+04 0.0e+00 9 15 91 74 0 1921100100 0 815
<SMP>
VecScatterBegin 137829 1.0 1.4310e+01 2.2 0.00e+00 0.0 8.3e+05 3.4e+04 0.0e+00 0 0 91 74 0 0 0100100 0 0
VecScatterEnd 137730 1.0 1.5035e+02 6.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 3 0 0 0 0 0
MatMult 137730 1.0 5.4179e+02 1.5 1.99e+08 1.4 8.3e+05 3.4e+04 0.0e+00 11 15 91 74 0 2121100100 0 536
Just looking at the time [in seconds] for VecScatterBegin()
,VecScatterEnd() ,MatMult() [which is the 4th column in the table]
we have:
[time in seconds]
Cluster SMP
VecScatterBegin 21 14
VecScatterEnd 30 150
MatMult 356 541
-----------------------------------
And MatMult is basically some local computation + Communication [which
is scatter time], then if you consider just the local coputation time
- and not the communication time, its its '356 -(21+30)' on the
cluster and '541-(14+150)' on the SMP box.
-----------------------------------
Communication cost 51 164
MatMult - (comm) 305 377
Considering this info - we can conclude the following:
** the communication cost on the the SMP box [164 seconds] is lot
higher than communication cost on the cluster [51 seconds]. Part of
the issue here is the load balance between all procs. [This is shown
by the 5th column in the table]
[load balance ratio]
Cluster SMP
VecScatterBegin 1.9 2.2
VecScatterEnd 1.6 6.5
MatMult 1.3 1.5
Somehow things are more balanced on the cluster than on the SMP,
causing some procs to run slower than others - resulting in higher
communication cost on the SMP box.
** The numerical part of MatMult is faster on the cluster [305
seconds] compared to the SMP box [377 seconds]. This is very likely
due to the memory bandwidth issues.
So both computation and communicaton times are better on the cluster
[for MatMult - which is an essential kernel in sparse matrix solve].
Satish
More information about the petsc-users
mailing list