PETSc runs slower on a shared memory machine than on a cluster

Sat Feb 3 19:00:04 CST 2007

On Sat, 3 Feb 2007, Shi Jin wrote:

> I do see that the cluster run is faster than the shared-memory
> case. However, I am not sure how I can tell the reason for this
> behavior is due to the memory subsystem. I don't know what evidence
> in the log to look for.

There were too many linewraps in the e-mailed text. Its best to send
such text as attachments so that the format is preserved [and
readable]

Event                Count      Time (sec)    Flops/sec                         --- Global ---  ---Stage ---   Total
                   Max Ratio  Max     Ratio   Max Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
<cluster>
VecScatterBegin   137829 1.0 2.1816e+01 1.9 0.00e+00 0.0 8.3e+05 3.4e+04 0.0e+00  0  0 91 74  0   1 0100100  0     0
VecScatterEnd     137730 1.0 3.0302e+01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0 0  0     0
MatMult           137730 1.0 3.5652e+02 1.3 2.58e+08 1.2 8.3e+05 3.4e+04 0.0e+00  9 15 91 74  0  1921100100  0   815
<SMP>
VecScatterBegin   137829 1.0 1.4310e+01 2.2 0.00e+00 0.0 8.3e+05 3.4e+04 0.0e+00  0  0 91 74  0   0 0100100  0     0
VecScatterEnd     137730 1.0 1.5035e+02 6.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   3  0  0 0  0     0
MatMult           137730 1.0 5.4179e+02 1.5 1.99e+08 1.4 8.3e+05 3.4e+04 0.0e+00 11 15 91 74  0  2121100100  0   536

Just looking at the time [in seconds] for VecScatterBegin()
,VecScatterEnd() ,MatMult() [which is the 4th column in the table]
we have:

[time in seconds]
                    Cluster     SMP 
VecScatterBegin      21         14
VecScatterEnd        30        150
MatMult             356        541
-----------------------------------

And MatMult is basically some local computation + Communication [which
is scatter time], then if you consider just the local coputation time
- and not the communication time, its its '356 -(21+30)' on the
cluster and '541-(14+150)' on the SMP box.

-----------------------------------
Communication cost   51        164
MatMult - (comm)    305        377

Considering this info - we can conclude the following:

** the communication cost on the the SMP box [164 seconds] is lot
higher than communication cost on the cluster [51 seconds]. Part of
the issue here is the load balance between all procs. [This is shown
by the 5th column in the table]

[load balance ratio]
                    Cluster      SMP
VecScatterBegin       1.9        2.2
VecScatterEnd         1.6        6.5
MatMult               1.3        1.5

Somehow things are more balanced on the cluster than on the SMP,
causing some procs to run slower than others - resulting in higher
communication cost on the SMP box. 

** The numerical part of MatMult is faster on the cluster [305
seconds] compared to the SMP box [377 seconds]. This is very likely
due to the memory bandwidth issues.

So both computation and communicaton times are better on the cluster
[for MatMult - which is an essential kernel in sparse matrix solve].

Satish