PETSc runs slower on a shared memory machine than on a cluster

Sat Feb 3 15:46:29 CST 2007

Thank you.
I did the same runs again with -log_summary. Here is
the part that I think is most important.
On cluster:
--- Event Stage 5: Projection
Event                Count      Time (sec)    
Flops/sec                         --- Global ---  ---
Stage ---   Total
                   Max Ratio  Max     Ratio   Max 
Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M
%L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
[x]rhsLu              99 1.0 2.3875e+02 1.0 0.00e+00
0.0 0.0e+00 0.0e+00 9.9e+01  7  0  0  0  0  14  0  0 
0  0     0
VecMDot           133334 1.0 4.1386e+02 1.6 3.43e+08
1.6 0.0e+00 0.0e+00 1.3e+05 10 18  0  0 45  21 27  0 
0 49   883
VecNorm           137829 1.0 6.9839e+01 1.5 1.27e+08
1.5 0.0e+00 0.0e+00 1.4e+05  2  1  0  0 46   4  2  0 
0 51   350
VecScale          137928 1.0 5.5639e+00 1.1 5.79e+08
1.1 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0 
0  0  2197
VecCopy             4495 1.0 8.4510e-01 1.1 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 
0  0     0
VecSet            142522 1.0 1.7712e+01 1.5 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0 
0  0     0
VecAXPY             8990 1.0 9.9013e-01 1.1 4.34e+08
1.1 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 
0  0  1610
VecMAXPY          137829 1.0 2.1687e+02 1.1 4.92e+08
1.1 0.0e+00 0.0e+00 0.0e+00  6 20  0  0  0  12 29  0 
0  0  1793
VecScatterBegin   137829 1.0 2.1816e+01 1.9 0.00e+00
0.0 8.3e+05 3.4e+04 0.0e+00  0  0 91 74  0   1 
0100100  0     0
VecScatterEnd     137730 1.0 3.0302e+01 1.6 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0 
0  0     0
VecNormalize      137829 1.0 7.6565e+01 1.4 1.68e+08
1.4 0.0e+00 0.0e+00 1.4e+05  2  2  0  0 46   4  3  0 
0 51   479
MatMult           137730 1.0 3.5652e+02 1.3 2.58e+08
1.2 8.3e+05 3.4e+04 0.0e+00  9 15 91 74  0  19
21100100  0   815
MatSolve          137829 1.0 5.0916e+02 1.2 1.56e+08
1.2 0.0e+00 0.0e+00 0.0e+00 13 14  0  0  0  28 20  0 
0  0   531
MatGetRow        44110737 1.0 1.1846e+02 1.0 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00  3  0  0  0  0   7  0  0 
0  0     0
KSPGMRESOrthog    133334 1.0 6.0430e+02 1.3 3.87e+08
1.3 0.0e+00 0.0e+00 1.3e+05 15 37  0  0 45  32 54  0 
0 49  1209
KSPSolve              99 1.0 1.4336e+03 1.0 2.37e+08
1.0 8.3e+05 3.4e+04 2.7e+05 40 68 91 74 91 
86100100100100   944
PCSetUpOnBlocks       99 1.0 3.2687e-04 1.2 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 
0  0     0
PCApply           137829 1.0 5.3316e+02 1.2 1.50e+08
1.2 0.0e+00 0.0e+00 0.0e+00 14 14  0  0  0  30 20  0 
0  0   507
---------------------------------------------------
On the shared memory machine:
--- Event Stage 5: Projection
Event                Count      Time (sec)    
Flops/sec                         --- Global ---  ---
Stage ---   Total
                   Max Ratio  Max     Ratio   Max 
Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M
%L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
[x]rhsLu              99 1.0 2.0673e+02 1.0 0.00e+00
0.0 0.0e+00 0.0e+00 9.9e+01  5  0  0  0  0   9  0  0 
0  0     0
VecMDot           133334 1.0 7.0932e+02 2.1 2.70e+08
2.1 0.0e+00 0.0e+00 1.3e+05 11 18  0  0 45  22 27  0 
0 49   515
VecNorm           137829 1.0 1.2860e+02 7.0 3.32e+08
7.0 0.0e+00 0.0e+00 1.4e+05  2  1  0  0 46   3  2  0 
0 51   190
VecScale          137928 1.0 5.0018e+00 1.0 6.36e+08
1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0 
0  0  2444
VecCopy             4495 1.0 1.4161e+00 1.8 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 
0  0     0
VecSet            142522 1.0 1.9602e+01 2.1 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0 
0  0     0
VecAXPY             8990 1.0 1.5128e+00 1.4 3.67e+08
1.4 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 
0  0  1054
VecMAXPY          137829 1.0 3.5204e+02 1.4 3.82e+08
1.4 0.0e+00 0.0e+00 0.0e+00  7 20  0  0  0  13 29  0 
0  0  1105
VecScatterBegin   137829 1.0 1.4310e+01 2.2 0.00e+00
0.0 8.3e+05 3.4e+04 0.0e+00  0  0 91 74  0   0 
0100100  0     0
VecScatterEnd     137730 1.0 1.5035e+02 6.5 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   3  0  0 
0  0     0
VecNormalize      137829 1.0 1.3453e+02 5.6 3.80e+08
5.6 0.0e+00 0.0e+00 1.4e+05  2  2  0  0 46   3  3  0 
0 51   272
MatMult           137730 1.0 5.4179e+02 1.5 1.99e+08
1.4 8.3e+05 3.4e+04 0.0e+00 11 15 91 74  0  21
21100100  0   536
MatSolve          137829 1.0 7.9682e+02 1.4 1.18e+08
1.4 0.0e+00 0.0e+00 0.0e+00 16 14  0  0  0  30 20  0 
0  0   339
MatGetRow        44110737 1.0 1.0296e+02 1.0 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   5  0  0 
0  0     0
KSPGMRESOrthog    133334 1.0 9.4927e+02 1.4 2.75e+08
1.4 0.0e+00 0.0e+00 1.3e+05 18 37  0  0 45  34 54  0 
0 49   770
KSPSolve              99 1.0 2.0562e+03 1.0 1.65e+08
1.0 8.3e+05 3.4e+04 2.7e+05 47 68 91 74 91 
91100100100100   658
PCSetUpOnBlocks       99 1.0 3.3998e-04 1.5 0.00e+00
0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 
0  0     0
PCApply           137829 1.0 8.2326e+02 1.4 1.14e+08
1.4 0.0e+00 0.0e+00 0.0e+00 16 14  0  0  0  31 20  0 
0  0   328

I do see that the cluster run is faster than the
shared-memory case. However, I am not sure how I can
tell the reason for this behavior is due to the memory
subsystem. I don't know what evidence in the log to
look for. 
Thanks again.

Shi
--- Satish Balay <balay at mcs.anl.gov> wrote:

> There are 2 aspects to performance.
> 
> - MPI performance [while message passing]
> - sequential performance for the numerical stuff.
> 
> So it could be that the SMP box has better MPI
> performance. This can
> be verified with -log_summary from both the runs
> [and looking at
> VecScatter times]
> 
> However with the sequential numerical codes - it
> primarily depends
> upon the bandwidth between the CPU and the memory.
> On the SMP box -
> depending upon how the memory subsystem is designed
> - the effective
> memory bandwidth per cpu could be a small fraction
> of the peak memory
> bandwidth [when all cpus are used]
> 
> So you'll have to look at the memory subsystem
> design of each of these
> machines and compare the 'memory bandwidth per cpu].
> The performance
> from log_summary - for ex: in MatMult will reflect
> this. [ including
> the above communication overhead]
> 
> Satish
> 
> On Fri, 2 Feb 2007, Shi Jin wrote:
> 
> > Hi there,
> > 
> > I am fairly new to PETSc but have 5 years of MPI
> > programming already. I recently took on a project
> of
> > analyzing a finite element code written in C with
> > PETSc.
> > I found out that on a shared-memory machine (60GB
> RAM,
> > 16    CPUS), the code runs around 4 times slower
> than
> > on a distributed memory cluster (4GB Ram,
> 4CPU/node),
> > although they yield identical results.
> > There are 1.6Million finite elements in the
> problem so
> > it is a fairly large calculation. The total memory
> > used is 3GBx16=48GB.
> > 
> > Both the two systems run Linux as OS and the same
> code
> > is compiled against the same version of MPICH-2
> and
> > PETSc.
> >  
> > The shared-memory machine is actually a little
> faster
> > than the cluster machines in terms of single
> process
> > runs.
> > 
> > I am surprised at this result since we usually
> tend to
> > think that shared-memory would be much faster
> since
> > the in-memory operation is much faster that the
> > network communication.
> > 
> > However, I read the PETSc FAQ and found that "the
> > speed of sparse matrix computations is almost
> totally
> > determined by the speed of the memory, not the
> speed
> > of the CPU". 
> > This makes me wonder whether the poor performance
> of
> > my code on a shared-memory machine is due to the
> > competition of different process on the same
> memory
> > bus. Since the code is still MPI based, a lot of
> data
> > are moving around inside the memory. Is this a
> > reasonable explanation of what I observed?
> > 
> > Thank you very much.
> > 
> > Shi
> > 
> > 
> >  
> >
>
____________________________________________________________________________________
> > Do you Yahoo!?
> > Everyone is raving about the all-new Yahoo! Mail
> beta.
> > http://new.mail.yahoo.com
> > 
> > 
> 
> 

____________________________________________________________________________________
Need a quick answer? Get one in minutes from people who know.
Ask your question on www.Answers.yahoo.com