PETSc runs slower on a shared memory machine than on a cluster

Shi Jin jinzishuai at
Mon Feb 5 17:23:24 CST 2007

Hi there,

I have made some new progress on the issue of SMP
performance. Since my shared memory machine is a 8
dual-core Opteron machine. I think the two cores on a
single CPU chip shares the memory bandwidth.
Therefore, if I can avoid using the same core on the
chip, I can get some performance improvement. Indeed,
I am able to do this by the linux command taskset. 
Here is what I did:
petscmpirun -n 8 taskset -c 0,2,4,6,8,10,12,14 ../spAF
This way, I specifically ask the processes to be run
on the first core on the CPUs. 
By doing this, my performance is doubled compared with
the simple petscmpirun -n 8 ../spAF

So this test shows that we do suffer from the
competition of resources of multiple processes,
especially when we use 16 processes.

However, I should point out that even with the help
taskset, the shared-memory performance is still 30%
less than  that on the cluster.

I am not sure whether this problem exists specifically
for the AMD machines or it applys to any shared-memory


--- Shi Jin <jinzishuai at> wrote:

> Hi there,
> I am fairly new to PETSc but have 5 years of MPI
> programming already. I recently took on a project of
> analyzing a finite element code written in C with
> PETSc.
> I found out that on a shared-memory machine (60GB
> RAM,
> 16    CPUS), the code runs around 4 times slower
> than
> on a distributed memory cluster (4GB Ram,
> 4CPU/node),
> although they yield identical results.
> There are 1.6Million finite elements in the problem
> so
> it is a fairly large calculation. The total memory
> used is 3GBx16=48GB.
> Both the two systems run Linux as OS and the same
> code
> is compiled against the same version of MPICH-2 and
> PETSc.
> The shared-memory machine is actually a little
> faster
> than the cluster machines in terms of single process
> runs.
> I am surprised at this result since we usually tend
> to
> think that shared-memory would be much faster since
> the in-memory operation is much faster that the
> network communication.
> However, I read the PETSc FAQ and found that "the
> speed of sparse matrix computations is almost
> totally
> determined by the speed of the memory, not the speed
> of the CPU". 
> This makes me wonder whether the poor performance of
> my code on a shared-memory machine is due to the
> competition of different process on the same memory
> bus. Since the code is still MPI based, a lot of
> data
> are moving around inside the memory. Is this a
> reasonable explanation of what I observed?
> Thank you very much.
> Shi
> Do you Yahoo!?
> Everyone is raving about the all-new Yahoo! Mail
> beta.

Expecting? Get great news right away with email Auto-Check. 
Try the Yahoo! Mail Beta. 

More information about the petsc-users mailing list