[MPICH] What is the most efficient MPICH-2 communication device for a dual-core dual-processor Linux PC?
Gus Correa
gus at ldeo.columbia.edu
Fri Sep 28 17:27:22 CDT 2007
Dear MPICH experts
What is the most efficient MPICH-2 communication device for a
dual-core dual-processor Linux PC (i.e. not a cluster, not distributed
memory)?
Among the four alternatives ch3:sock, ch3:ssm, ch3:shm, and ch3:nemesis,
I presume ssm and shm work the same way in a SMP machine like this,
but beyond this I don't know which of the four devices would be the best.
I don't need threading or any MPI-2 specific features such as dynamic
process creation,
hence I guess I can choose any one of the four devices.
I am just concerned about performance,
and to take the best possible advantage of the multicore /
multiprocessor feature.
This question was probably asked before,
hence I would be happy if you just give me a pointer
to the right thread on the mpich-discuss list archive that may contain
the answer or a simple benchmark.
There is more information about my problem below, if you need.
Many thanks,
Gus Correa
*************
More info:
0) I am using ch3:sock, just because it is the default!
1) The computer is a 64-bit dual-core dual-processor 3GHz Intel Xeon
standalone PC with 4GB of memory.
2) The OS is Linux 2.6.22.5-76.fc7 (Fedora Core 7).
3) The compilers are gcc and Intel Fortran 10.0.023.
4) The version of MPICH-2 is 1.0.5p4, but I can upgrade to the latest
greatest, if this is required.
5) Compilation was done in 64-bit mode.
6) The program I am running is AM2.1 from GFDL (see
http://www.gfdl.noaa.gov/fms/),
written in Fortran 90/95.
It is a typical grid and domain decomposition problem, although a quite
complex code.
I don't know if the program scales well with the number of processors, BTW,
but it is not dominated by I/O, and the computational effort
is in number crunching and communication.
7) The execution times (wall clock) I get for one model-month simulation
are:
- 1 process : 224 minutes
- 2 processes : 122 minutes
- 3 processes: 108 minutes
(FYI, due to restrictions in the program itself, I cannot use 4 processes.
The program requires that the number of processes divides 90,
which is the number of points in one of the grid directions.)
Note that the execution time ratio between 2 and 1 process is close to
1/2, which made me happy. :)
However, with 3 processes the run takes almost the same time as with 2,
so things become disappointingly slow. :(
What is the reason for this slow down?
Should I use another device, instead of ch3:sock? Which comm device is
the best?
Or does it slow down because the multicore feature kicks in when I move
from 2 to 3 processes,
and perhaps multicore is not really so fast? (I which case changing
comm device would not speed up.)
Thank you,
Gus Correa
--
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
Oceanography Bldg., Rm. 103-D, ph. (845) 365-8911, fax (845) 365-8736
---------------------------------------------------------------------
More information about the mpich-discuss
mailing list