[MPICH] What is the most efficient MPICH-2 communication device for a dual-core dual-processor Linux PC?

Fri Sep 28 17:27:22 CDT 2007

Dear MPICH experts

What is the most efficient MPICH-2 communication device for a
dual-core dual-processor Linux PC (i.e. not a cluster, not distributed 
memory)?

Among the four alternatives ch3:sock, ch3:ssm, ch3:shm, and ch3:nemesis,
I presume ssm and shm work the same way in a SMP machine like this,
but beyond this I don't know which of the four devices would be the best.

I don't need threading or any MPI-2 specific features such as dynamic 
process creation,
hence I guess I can choose any one of the four devices.
I am just concerned about performance,
and to take the best possible advantage of the multicore / 
multiprocessor feature.

This question was probably asked before,
hence I would be happy if you just give me a pointer
to the right thread on the mpich-discuss list archive that may contain 
the answer or a simple benchmark.

There is more information about my problem below, if you need.

Many thanks,
Gus Correa

*************

More info:

0) I am using ch3:sock, just because it is the default!
1) The computer is a 64-bit dual-core dual-processor  3GHz Intel Xeon 
standalone PC with 4GB of memory.
2) The OS is Linux 2.6.22.5-76.fc7 (Fedora Core 7).
3) The compilers are gcc and Intel Fortran 10.0.023.
4) The version of MPICH-2 is 1.0.5p4, but I can upgrade to the latest 
greatest, if this is required.
5) Compilation was done in 64-bit mode.
6) The program I am running is AM2.1 from GFDL (see 
http://www.gfdl.noaa.gov/fms/),
written in Fortran 90/95.
It is a typical grid and domain decomposition problem, although a quite 
complex code.
I don't know if the program scales well with the number of processors, BTW,
but it is not dominated by I/O, and the computational effort
is in number crunching and communication.
7) The execution times (wall clock) I get for one model-month simulation 
are:

   - 1 process :     224 minutes
   - 2 processes : 122 minutes
   - 3 processes:  108 minutes

(FYI, due to restrictions in the program itself, I cannot use 4 processes.
The program requires that the number of processes divides 90,
which is the number of points in one of the grid directions.)

Note that the execution time ratio between 2 and 1 process is close to 
1/2, which made me happy.  :)
However, with 3 processes the run takes almost the same time as with 2,
so things become disappointingly slow.  :(

What is the reason for this slow down?
Should I use another device, instead of ch3:sock?  Which comm device is 
the best?
Or does it slow down because the multicore feature kicks in when I move 
from 2 to 3 processes,
and perhaps multicore is not really so fast?  (I which case changing 
comm device would not speed up.)

Thank you,
Gus Correa

-- 
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
Oceanography Bldg., Rm. 103-D, ph. (845) 365-8911, fax (845) 365-8736
---------------------------------------------------------------------