PETSc runs slower on a shared memory machine than on a cluster

Fri Feb 2 15:55:02 CST 2007

There are 2 aspects to performance.

- MPI performance [while message passing]
- sequential performance for the numerical stuff.

So it could be that the SMP box has better MPI performance. This can
be verified with -log_summary from both the runs [and looking at
VecScatter times]

However with the sequential numerical codes - it primarily depends
upon the bandwidth between the CPU and the memory. On the SMP box -
depending upon how the memory subsystem is designed - the effective
memory bandwidth per cpu could be a small fraction of the peak memory
bandwidth [when all cpus are used]

So you'll have to look at the memory subsystem design of each of these
machines and compare the 'memory bandwidth per cpu]. The performance
from log_summary - for ex: in MatMult will reflect this. [ including
the above communication overhead]

Satish

On Fri, 2 Feb 2007, Shi Jin wrote:

> Hi there,
> 
> I am fairly new to PETSc but have 5 years of MPI
> programming already. I recently took on a project of
> analyzing a finite element code written in C with
> PETSc.
> I found out that on a shared-memory machine (60GB RAM,
> 16    CPUS), the code runs around 4 times slower than
> on a distributed memory cluster (4GB Ram, 4CPU/node),
> although they yield identical results.
> There are 1.6Million finite elements in the problem so
> it is a fairly large calculation. The total memory
> used is 3GBx16=48GB.
> 
> Both the two systems run Linux as OS and the same code
> is compiled against the same version of MPICH-2 and
> PETSc.
>  
> The shared-memory machine is actually a little faster
> than the cluster machines in terms of single process
> runs.
> 
> I am surprised at this result since we usually tend to
> think that shared-memory would be much faster since
> the in-memory operation is much faster that the
> network communication.
> 
> However, I read the PETSc FAQ and found that "the
> speed of sparse matrix computations is almost totally
> determined by the speed of the memory, not the speed
> of the CPU". 
> This makes me wonder whether the poor performance of
> my code on a shared-memory machine is due to the
> competition of different process on the same memory
> bus. Since the code is still MPI based, a lot of data
> are moving around inside the memory. Is this a
> reasonable explanation of what I observed?
> 
> Thank you very much.
> 
> Shi
> 
> 
>  
> ____________________________________________________________________________________
> Do you Yahoo!?
> Everyone is raving about the all-new Yahoo! Mail beta.
> http://new.mail.yahoo.com
> 
>