PETSc runs slower on a shared memory machine than on a cluster

Sat Feb 3 18:57:29 CST 2007

              Total Flop rate
  Cluster

VecMAXPY       1793
MatSolve        815

  Shared memory

VecMAXPY       1105
MatSolve        339

The vector operations in MAXPY and the triangular solves in MatSolve are 
memory bandwidth limited (triangular solves extremely). When all the processers
are demanding their needed memory bandwidth in the triangular solves the performance
suffers 339 vs 815 from the distributed memory case where each processor has its own
memory.

  Barry

On Sat, 3 Feb 2007, Shi Jin wrote:

> Thank you.
> I did the same runs again with -log_summary. Here is
> the part that I think is most important.
> On cluster:
> --- Event Stage 5: Projection
> Event                Count      Time (sec)    
> Flops/sec                         --- Global ---  ---
> Stage ---   Total
>                    Max Ratio  Max     Ratio   Max 
> Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M
> %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
> [x]rhsLu              99 1.0 2.3875e+02 1.0 0.00e+00
> 0.0 0.0e+00 0.0e+00 9.9e+01  7  0  0  0  0  14  0  0 
> 0  0     0
> VecMDot           133334 1.0 4.1386e+02 1.6 3.43e+08
> 1.6 0.0e+00 0.0e+00 1.3e+05 10 18  0  0 45  21 27  0 
> 0 49   883
> VecNorm           137829 1.0 6.9839e+01 1.5 1.27e+08
> 1.5 0.0e+00 0.0e+00 1.4e+05  2  1  0  0 46   4  2  0 
> 0 51   350
> VecScale          137928 1.0 5.5639e+00 1.1 5.79e+08
> 1.1 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0 
> 0  0  2197
> VecCopy             4495 1.0 8.4510e-01 1.1 0.00e+00
> 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 
> 0  0     0
> VecSet            142522 1.0 1.7712e+01 1.5 0.00e+00
> 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0 
> 0  0     0
> VecAXPY             8990 1.0 9.9013e-01 1.1 4.34e+08
> 1.1 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 
> 0  0  1610
> VecMAXPY          137829 1.0 2.1687e+02 1.1 4.92e+08
> 1.1 0.0e+00 0.0e+00 0.0e+00  6 20  0  0  0  12 29  0 
> 0  0  1793
> VecScatterBegin   137829 1.0 2.1816e+01 1.9 0.00e+00
> 0.0 8.3e+05 3.4e+04 0.0e+00  0  0 91 74  0   1 
> 0100100  0     0
> VecScatterEnd     137730 1.0 3.0302e+01 1.6 0.00e+00
> 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0 
> 0  0     0
> VecNormalize      137829 1.0 7.6565e+01 1.4 1.68e+08
> 1.4 0.0e+00 0.0e+00 1.4e+05  2  2  0  0 46   4  3  0 
> 0 51   479
> MatMult           137730 1.0 3.5652e+02 1.3 2.58e+08
> 1.2 8.3e+05 3.4e+04 0.0e+00  9 15 91 74  0  19
> 21100100  0   815
> MatSolve          137829 1.0 5.0916e+02 1.2 1.56e+08
> 1.2 0.0e+00 0.0e+00 0.0e+00 13 14  0  0  0  28 20  0 
> 0  0   531
> MatGetRow        44110737 1.0 1.1846e+02 1.0 0.00e+00
> 0.0 0.0e+00 0.0e+00 0.0e+00  3  0  0  0  0   7  0  0 
> 0  0     0
> KSPGMRESOrthog    133334 1.0 6.0430e+02 1.3 3.87e+08
> 1.3 0.0e+00 0.0e+00 1.3e+05 15 37  0  0 45  32 54  0 
> 0 49  1209
> KSPSolve              99 1.0 1.4336e+03 1.0 2.37e+08
> 1.0 8.3e+05 3.4e+04 2.7e+05 40 68 91 74 91 
> 86100100100100   944
> PCSetUpOnBlocks       99 1.0 3.2687e-04 1.2 0.00e+00
> 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 
> 0  0     0
> PCApply           137829 1.0 5.3316e+02 1.2 1.50e+08
> 1.2 0.0e+00 0.0e+00 0.0e+00 14 14  0  0  0  30 20  0 
> 0  0   507
> ---------------------------------------------------
> On the shared memory machine:
> --- Event Stage 5: Projection
> Event                Count      Time (sec)    
> Flops/sec                         --- Global ---  ---
> Stage ---   Total
>                    Max Ratio  Max     Ratio   Max 
> Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M
> %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
> [x]rhsLu              99 1.0 2.0673e+02 1.0 0.00e+00
> 0.0 0.0e+00 0.0e+00 9.9e+01  5  0  0  0  0   9  0  0 
> 0  0     0
> VecMDot           133334 1.0 7.0932e+02 2.1 2.70e+08
> 2.1 0.0e+00 0.0e+00 1.3e+05 11 18  0  0 45  22 27  0 
> 0 49   515
> VecNorm           137829 1.0 1.2860e+02 7.0 3.32e+08
> 7.0 0.0e+00 0.0e+00 1.4e+05  2  1  0  0 46   3  2  0 
> 0 51   190
> VecScale          137928 1.0 5.0018e+00 1.0 6.36e+08
> 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0 
> 0  0  2444
> VecCopy             4495 1.0 1.4161e+00 1.8 0.00e+00
> 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 
> 0  0     0
> VecSet            142522 1.0 1.9602e+01 2.1 0.00e+00
> 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0 
> 0  0     0
> VecAXPY             8990 1.0 1.5128e+00 1.4 3.67e+08
> 1.4 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 
> 0  0  1054
> VecMAXPY          137829 1.0 3.5204e+02 1.4 3.82e+08
> 1.4 0.0e+00 0.0e+00 0.0e+00  7 20  0  0  0  13 29  0 
> 0  0  1105
> VecScatterBegin   137829 1.0 1.4310e+01 2.2 0.00e+00
> 0.0 8.3e+05 3.4e+04 0.0e+00  0  0 91 74  0   0 
> 0100100  0     0
> VecScatterEnd     137730 1.0 1.5035e+02 6.5 0.00e+00
> 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   3  0  0 
> 0  0     0
> VecNormalize      137829 1.0 1.3453e+02 5.6 3.80e+08
> 5.6 0.0e+00 0.0e+00 1.4e+05  2  2  0  0 46   3  3  0 
> 0 51   272
> MatMult           137730 1.0 5.4179e+02 1.5 1.99e+08
> 1.4 8.3e+05 3.4e+04 0.0e+00 11 15 91 74  0  21
> 21100100  0   536
> MatSolve          137829 1.0 7.9682e+02 1.4 1.18e+08
> 1.4 0.0e+00 0.0e+00 0.0e+00 16 14  0  0  0  30 20  0 
> 0  0   339
> MatGetRow        44110737 1.0 1.0296e+02 1.0 0.00e+00
> 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   5  0  0 
> 0  0     0
> KSPGMRESOrthog    133334 1.0 9.4927e+02 1.4 2.75e+08
> 1.4 0.0e+00 0.0e+00 1.3e+05 18 37  0  0 45  34 54  0 
> 0 49   770
> KSPSolve              99 1.0 2.0562e+03 1.0 1.65e+08
> 1.0 8.3e+05 3.4e+04 2.7e+05 47 68 91 74 91 
> 91100100100100   658
> PCSetUpOnBlocks       99 1.0 3.3998e-04 1.5 0.00e+00
> 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0 
> 0  0     0
> PCApply           137829 1.0 8.2326e+02 1.4 1.14e+08
> 1.4 0.0e+00 0.0e+00 0.0e+00 16 14  0  0  0  31 20  0 
> 0  0   328
> 
> I do see that the cluster run is faster than the
> shared-memory case. However, I am not sure how I can
> tell the reason for this behavior is due to the memory
> subsystem. I don't know what evidence in the log to
> look for. 
> Thanks again.
> 
> Shi
> --- Satish Balay <balay at mcs.anl.gov> wrote:
> 
> > There are 2 aspects to performance.
> > 
> > - MPI performance [while message passing]
> > - sequential performance for the numerical stuff.
> > 
> > So it could be that the SMP box has better MPI
> > performance. This can
> > be verified with -log_summary from both the runs
> > [and looking at
> > VecScatter times]
> > 
> > However with the sequential numerical codes - it
> > primarily depends
> > upon the bandwidth between the CPU and the memory.
> > On the SMP box -
> > depending upon how the memory subsystem is designed
> > - the effective
> > memory bandwidth per cpu could be a small fraction
> > of the peak memory
> > bandwidth [when all cpus are used]
> > 
> > So you'll have to look at the memory subsystem
> > design of each of these
> > machines and compare the 'memory bandwidth per cpu].
> > The performance
> > from log_summary - for ex: in MatMult will reflect
> > this. [ including
> > the above communication overhead]
> > 
> > Satish
> > 
> > On Fri, 2 Feb 2007, Shi Jin wrote:
> > 
> > > Hi there,
> > > 
> > > I am fairly new to PETSc but have 5 years of MPI
> > > programming already. I recently took on a project
> > of
> > > analyzing a finite element code written in C with
> > > PETSc.
> > > I found out that on a shared-memory machine (60GB
> > RAM,
> > > 16    CPUS), the code runs around 4 times slower
> > than
> > > on a distributed memory cluster (4GB Ram,
> > 4CPU/node),
> > > although they yield identical results.
> > > There are 1.6Million finite elements in the
> > problem so
> > > it is a fairly large calculation. The total memory
> > > used is 3GBx16=48GB.
> > > 
> > > Both the two systems run Linux as OS and the same
> > code
> > > is compiled against the same version of MPICH-2
> > and
> > > PETSc.
> > >  
> > > The shared-memory machine is actually a little
> > faster
> > > than the cluster machines in terms of single
> > process
> > > runs.
> > > 
> > > I am surprised at this result since we usually
> > tend to
> > > think that shared-memory would be much faster
> > since
> > > the in-memory operation is much faster that the
> > > network communication.
> > > 
> > > However, I read the PETSc FAQ and found that "the
> > > speed of sparse matrix computations is almost
> > totally
> > > determined by the speed of the memory, not the
> > speed
> > > of the CPU". 
> > > This makes me wonder whether the poor performance
> > of
> > > my code on a shared-memory machine is due to the
> > > competition of different process on the same
> > memory
> > > bus. Since the code is still MPI based, a lot of
> > data
> > > are moving around inside the memory. Is this a
> > > reasonable explanation of what I observed?
> > > 
> > > Thank you very much.
> > > 
> > > Shi
> > > 
> > > 
> > >  
> > >
> >
> ____________________________________________________________________________________
> > > Do you Yahoo!?
> > > Everyone is raving about the all-new Yahoo! Mail
> > beta.
> > > http://new.mail.yahoo.com
> > > 
> > > 
> > 
> > 
> 
> 
> 
>  
> ____________________________________________________________________________________
> Need a quick answer? Get one in minutes from people who know.
> Ask your question on www.Answers.yahoo.com
> 
>